Malware Source code string extraction

Malware Source code string extraction
文章探讨了如何通过分析恶意软件源代码中的字符串和文件扩展名来快速识别潜在的恶意活动和内部威胁。通过提取独特的文件名和关注编程相关的文件扩展名，可以在短时间内发现有价值的 forensic artifacts，并用于检测恶意行为或员工违规操作。 2025-3-30 00:16:52 Author: www.hexacorn.com(查看原文) 阅读量:39 收藏

Every once in a while we put our hands on a source code corpora of some malware (thx vx-underground!). Whether it is a quality release or not, we don’t care, because we know we usually get a kinda mixed bag of data and code – and as long as it leans towards ‘new’ and ‘quality’ we still benefit from getting access to some of this ‘bad’ code – typically written in C, C++, .NET, Go, Rust, or… AutoIt.

No matter what the language of choice though, we always want to start such code corpora analysis by cherry-picking the low hanging fruits first…

One way to do it is by extracting all strings referenced by the leaked source code. This is because some of these extracted strings are so unique that they can form a set of perfect (unique) IOCs. It’s not surprising then that having a proper methodology in place to identify this sort of artifacts quickly is very important – everyone loves quick, impactful wins.

But what is a source code file, you may ask?

Depending on the era you are from, your preferred OS, programming language, compiler, IDE… it can mean a lot of different things. Even in 2025 there are many people today who still program in Visual Basic for Applications, Visual Basic Script, Perl, and even Cobol, Fortran, or not-so-old Delphi, while many (more modern programmers) can’t live without Go, Rust, Nim, and Python. And then some other folks still make a paycheck living off .bat, .cmd and .vbs files despite the fact the Windows sysadmin world pretty much endorsed the PowerShell’s power and moved on from 90s to now like… 10-15 years ago. And then some OGs still maintain HTA scripts, some still write multi platform code in C, some live assembly all day long, and some more recent coders often don’t even know what they are doing (copying and pasting chatGPT-generated code to their consoles hoping it can do all the magic for them). And we should not forget the files that describe installers’ inner workings, compilation process, linking process, and others, where the scripting/coding capabilities are still present, but may not be immediately apparent.

What’s constant about all these use cases listed above is that most of the files created by both conservative programmers of 70s, 80s, 90s, 2000s, and more ‘modern’ code generated by the children of 2010s and 2020s almost inevitably end up being saved to files with a predictable file extension. Malware authors are bound by the spell of file extensions, too. And even the most conservative macOS and Linux users cannot escape this predictable behavior, and thanks to that, we can still make an attempt to build an ultimate list of file extensions that refer to programming activities in one way or another, one that covers all the modern desktop OSes. And while we do that, we intentionally exclude HTML and CSS files and their derivatives. They are very ‘spammy’ in nature, and the code present in these files is (most of the time) not a ‘real’ code.

Why do we pay so much attention to file extensions, you may ask? Large corpora of source code has a very peculiar problem that we need to solve: there are too many files too look at.

Here’s a histogram of file extensions from the repo referenced above:

================
85088
================
png	26631
ico	18024
dll	6070
exe	3630
bmp	2926
au3	2429
js	2205
txt	1978
html	1845
smali	1611
skn	1608
7z	1548
gif	872
svg	786
ini	650
css	608
md	603
scss	599
wav	585
xaml	500
xml	475
jpg	467
class	463
dat	411
ocx	374
asz	370
jar	362
pdb	293
ps1	269
	200
pl	178
bin	173
ttf	173
db	159
cs	153
php	150
inf	149
config	142
json	141
lng	130
zip	118
pak	114

We can easily discard media files, compiled binary objects, libraries, executables, font files, but if we really want quick wins we need to stay very focused.

So…

We should ask: what are the file extensions that are related to programming activities in the year 2025?

It’s actually a very long list…:

accdb
ace
ahk
app
appxmanifest
asm
asp
aspx
au3
awk
backup
bak
bas
bat
bdsproj
btm
c
cbl
cc
cfg
cgi
cls
cmakelists.txt
cmd
cnf
cob
conf
config
cpp
cppm
cs
csproj
cu
cxx
dcu
def
dfm
dlg
doc
docm
docx
dpc
dpj
dpk
dpr
dproj
dtd
dxp
e
eng
f
flt
fmt
for
fp
frm
frx
gnumakefile
go
h
hdl
hh
hhp
hid
hpp
hrc
hta
hwp
hwpx
hxx
idl
inc
inf
info
ini
ins
iss
ixx
java
js
jse
json
jsp
jsproj
jqs
kdelnk
l
lfm
lgt
ll
lng
lnk
lnx
lpk
lpr
lst
m
mac
macos
makefile
manifest
map
md
mdb
mk
mod
myapp
nfo
odf
odg
odp
ods
odt
nsi
par
pas
pdf
php
php3
pl
pm
pmk
policy
pp
pps
ppt
pptx
pre
prj
properties
ps
ps1
py
r
rb
rc
rdb
rdme
reg
resources
s
sbl
scp
sdi
seg
settings
sh
sln
smali
smf
sms
source
sql
src
swift
tag
toml
unx
url
vb
vba
vbe
vbp
vbproj
vbs
vcxproj
wsh
xaml
xfm
xls
xlsm
xlsx
xml
y
yaml
yml
yxx
~ddp
~dfm

It’s a long list and it’s a decent list, even if it will never be ‘final’. It covers very old programming languages, it covers many file extensions used by decades-long iterations of popular programming languages, it covers both compiled and interpreted languages, it covers commercial & open-source programming projects. It covers Microsoft Office, Open/Libre Office, Hangul Office macros, Mathlab, it covers configuration files, make files, it covers data files, it covers project files, resource files, header files, definition files, localization files, etc. etc. Most of data/code these files store are saved in a plain text format, but then of course, some store them in a compressed, encoded or otherwise non-trivial to extract form.

Coming back to the topic of this post… analysing large data sets that include source code of many malware families that are made available via leaks or releases of curated collections may sometimes feel like a very mundane and counterproductive task, but the approach I want to propose here can give us tangible results very quickly.

For instance, extracting all the quoted strings from a large corpora of malicious source code files allows us to quickly identify many hardcoded file names used by the malware. These file names can be then used to quickly detect malware-related activity within a EDR/XDR telemetry. Queries focused on these hardcoded artifacts will help to detect the actual infections + the unwanted activities of employees (possible insider threat) who are downloading such malicious repos to their corporate devices thinking this is an acceptable way of ‘analysing’ malicious data (it’s usually not as it is an Acceptable Use Policy (AUP) violation in most of these cases).

Now that we have all this administrative fluff out of the way, let’s do some quick data crunching.

After unpacking all the archives present inside the sampleset referenced by the first paragraph of this post, we look at all quoted strings referenced by the source code found inside the files with extensions belonging to the ‘programming file extensions’ set listed above. We then narrow down our attention to look for .txt file names that we extract from that string set, and then we manually eyeball them all to quickly build a list of interesting artifacts.

If we are properly prepared, it takes no more than 30 minutes to quickly extract interesting forensic artifacts from such a large source code corpora. Another 30 to eyeball the results and… 4h to write this blog post.

文章来源: https://www.hexacorn.com/blog/2025/03/30/malware-source-code-string-extraction/
如有侵权请联系:admin#unsafe.sh