Every once in a while we put our hands on a source code corpora of some malware (thx vx-underground!). Whether it is a quality release or not, we don’t care, because we know we usually get a kinda mixed bag of data and code – and as long as it leans towards ‘new’ and ‘quality’ we still benefit from getting access to some of this ‘bad’ code – typically written in C, C++, .NET, Go, Rust, or… AutoIt.
No matter what the language of choice though, we always want to start such code corpora analysis by cherry-picking the low hanging fruits first…
One way to do it is by extracting all strings referenced by the leaked source code. This is because some of these extracted strings are so unique that they can form a set of perfect (unique) IOCs. It’s not surprising then that having a proper methodology in place to identify this sort of artifacts quickly is very important – everyone loves quick, impactful wins.
But what is a source code file, you may ask?
Depending on the era you are from, your preferred OS, programming language, compiler, IDE… it can mean a lot of different things. Even in 2025 there are many people today who still program in Visual Basic for Applications, Visual Basic Script, Perl, and even Cobol, Fortran, or not-so-old Delphi, while many (more modern programmers) can’t live without Go, Rust, Nim, and Python. And then some other folks still make a paycheck living off .bat, .cmd and .vbs files despite the fact the Windows sysadmin world pretty much endorsed the PowerShell’s power and moved on from 90s to now like… 10-15 years ago. And then some OGs still maintain HTA scripts, some still write multi platform code in C, some live assembly all day long, and some more recent coders often don’t even know what they are doing (copying and pasting chatGPT-generated code to their consoles hoping it can do all the magic for them). And we should not forget the files that describe installers’ inner workings, compilation process, linking process, and others, where the scripting/coding capabilities are still present, but may not be immediately apparent.
What’s constant about all these use cases listed above is that most of the files created by both conservative programmers of 70s, 80s, 90s, 2000s, and more ‘modern’ code generated by the children of 2010s and 2020s almost inevitably end up being saved to files with a predictable file extension. Malware authors are bound by the spell of file extensions, too. And even the most conservative macOS and Linux users cannot escape this predictable behavior, and thanks to that, we can still make an attempt to build an ultimate list of file extensions that refer to programming activities in one way or another, one that covers all the modern desktop OSes. And while we do that, we intentionally exclude HTML and CSS files and their derivatives. They are very ‘spammy’ in nature, and the code present in these files is (most of the time) not a ‘real’ code.
Why do we pay so much attention to file extensions, you may ask? Large corpora of source code has a very peculiar problem that we need to solve: there are too many files too look at.
Here’s a histogram of file extensions from the repo referenced above:
================ 85088 ================ png 26631 ico 18024 dll 6070 exe 3630 bmp 2926 au3 2429 js 2205 txt 1978 html 1845 smali 1611 skn 1608 7z 1548 gif 872 svg 786 ini 650 css 608 md 603 scss 599 wav 585 xaml 500 xml 475 jpg 467 class 463 dat 411 ocx 374 asz 370 jar 362 pdb 293 ps1 269 200 pl 178 bin 173 ttf 173 db 159 cs 153 php 150 inf 149 config 142 json 141 lng 130 zip 118 pak 114
We can easily discard media files, compiled binary objects, libraries, executables, font files, but if we really want quick wins we need to stay very focused.
So…
We should ask: what are the file extensions that are related to programming activities in the year 2025?
It’s actually a very long list…:
It’s a long list and it’s a decent list, even if it will never be ‘final’. It covers very old programming languages, it covers many file extensions used by decades-long iterations of popular programming languages, it covers both compiled and interpreted languages, it covers commercial & open-source programming projects. It covers Microsoft Office, Open/Libre Office, Hangul Office macros, Mathlab, it covers configuration files, make files, it covers data files, it covers project files, resource files, header files, definition files, localization files, etc. etc. Most of data/code these files store are saved in a plain text format, but then of course, some store them in a compressed, encoded or otherwise non-trivial to extract form.
Coming back to the topic of this post… analysing large data sets that include source code of many malware families that are made available via leaks or releases of curated collections may sometimes feel like a very mundane and counterproductive task, but the approach I want to propose here can give us tangible results very quickly.
For instance, extracting all the quoted strings from a large corpora of malicious source code files allows us to quickly identify many hardcoded file names used by the malware. These file names can be then used to quickly detect malware-related activity within a EDR/XDR telemetry. Queries focused on these hardcoded artifacts will help to detect the actual infections + the unwanted activities of employees (possible insider threat) who are downloading such malicious repos to their corporate devices thinking this is an acceptable way of ‘analysing’ malicious data (it’s usually not as it is an Acceptable Use Policy (AUP) violation in most of these cases).
Now that we have all this administrative fluff out of the way, let’s do some quick data crunching.
After unpacking all the archives present inside the sampleset referenced by the first paragraph of this post, we look at all quoted strings referenced by the source code found inside the files with extensions belonging to the ‘programming file extensions’ set listed above. We then narrow down our attention to look for .txt file names that we extract from that string set, and then we manually eyeball them all to quickly build a list of interesting artifacts.
If we are properly prepared, it takes no more than 30 minutes to quickly extract interesting forensic artifacts from such a large source code corpora. Another 30 to eyeball the results and… 4h to write this blog post.