Analysing NSRL data set for fun and because… curious, Part 2
2022-2-7 06:38:6 Author: www.hexacorn.com(查看原文) 阅读量:20 收藏

February 6, 2022 in Forensic Analysis, NSRL

This is the second post discussing what we can find inside the NSRL data set.

At this stage we know it’s not only file hashes, but also sections of executables and java .class files stored inside JAR files. Digging a bit more in the file name statistics we find there is another subset of hashes that is quite substantial: MSI tables. They happen to be named in a very specific way e.g. with an exclamation point as a prefix. There are 350K entries like this:

35666 "!_StringData"
26126 "!_StringPool"
14844 "!File"
11991 "!Property"
11530 "!Error"
9987 "!Media"
9545 "!MsiFileHash"
9063 "!Feature"
9026 "!InstallExecuteSequence"
8949 "!Component"
8844 "!Registry"
8822 "!CustomAction"
8781 "!Directory"
8702 "!FeatureComponents"
7535 "!UIText"
7080 "!_Columns"
7009 "!_Tables"
6609 "!Binary"
5837 "!Control"
5540 "!AdvtExecuteSequence"
5528 "!Upgrade"
5500 "!_Validation"
5236 "!RegLocator"
5174 "!ActionText"
5085 "!InstallUISequence"
4843 "!AdminExecuteSequence"
4704 "!CreateFolder"
4638 "!Dialog"
4606 "!RadioButton"
4087 "!AppSearch"

While it may look like a lot, if we exclude all file names that start with an exclamation mark, dot (a bit unfair, but a good estimate), and .class files we drop the number of entries by nearly 30%:

192,677,750 - all
 57,390,395 - !<filename>, .<filename>, <filename>.class
135,287,355 - excluding !<filename>, .<filename>, <filename>.class

We could futher narow it down by excluding filenames starting with bracketed numbers f.ex. [5]SummaryInformation or underscores f.ex. __DATA__la_symbol_ptr. There is also a substantial number of filenames that are just numbers, numbers with media file extensions (f.ex. 1494.bmp) or are one way or another related to executable resources (manifest.txt, version.txt, CERTIFICATE, VERSION, etc.).

Another area of interest are files that will be always uniquely bound to the NSRL test systems where they were generated and will never appear on other systems f.ex. files with the following properties. <filename>.pyc, <filename>.pyo, .gitignore.

Kudos to whoever is responsible for maintaining the NSRL set. It is an incredibly difficult task to build a list of good hashes. It’s tempting to unpack, decompile, debundle and sometimes it may just generate too much noise. I hope that unless I missed something, future versions of the set will include a flag for each entry to indicate whether the file is an embedded resource or a regular file system object. And another useful entry would be a parent. So that one could rebuild a tree of parent-child relationships leading it back to original ancestor file f.ex. if we start with a .msi file that includes a .zip file that includes a bunch of PE files, we could trace it back from the PE file back to .msi and vice versa.

But hey… what does ‘file’ even mean today? Is a .class inside .JAR file a file system object? Or a resource hidden from the file system by the packager abstraction layer?


文章来源: https://www.hexacorn.com/blog/2022/02/06/analysing-nsrl-data-set-for-fun-and-because-curious-part-2/
如有侵权请联系:admin#unsafe.sh