February 4, 2022 in Forensic Analysis, NSRL
Last year I took a very quick look at NSRL hash set. Being de facto golden standard of good hashes I was curious what sort of data is actually included in it. There is no better way of looking at it than actually looking at it so I downloaded the files and started some basic analysis.
The NSRLFile.txt.zip stores 26G file NSRLFile.txt that includes the following number of entries:
192677750 NSRLFile.txt
Yay, that’s a lot!
Now that we know how many entries we have in it, let’s try to see what sort of file extensions we can find inside the file. After parsing the data set I came up with the following results (top 30):
class 44776614 <no ext> 36040752 png 8968329 manifest 6258021 js 4846976 foliage 4513590 py 3938119 dll 3658422 java 3263630 nasl 3256040 html 3068834 h 2671295 xml 2582138 cat 2355352 htm 2197584 mum 2164820 txt 1906815 uasset 1859095 dat 1779063 o 1749951 c 1697767 gz 1691747 mui 1377019 svg 1018434 properties 1004380 mo 1000680 gif 973653 ogg 842367 dds 822591 upk 772181
This is a very interesting statistic — it tells us that a substantial number of records relate to .class files – compiled Java files. This is most likely a result of an automated processing where every single .JAR file is being unzipped and every embedded .class files is accounted for. There are tones of non-executables as well (media files, source code, etc.). It’s a majority of them, really.
Looking specifically for executable / installer files we quickly realize that these are indeed quite scarce:
dll 3658422 o 1749951 bin 557524 so 504442 exe 346292 sys 98390 msi 24763 bat 31482 ps1 13433 cmd 10931 scr 8567 drv 5350
File extensions tell us roughly what file types we deal with, but another great way to look at the NSRL set is statistical analysis of its file names. A quick & dirty histogram I came up with looks like this (top 30):
1221091 ".text" 744897 "1" 719540 "text" 641856 ".reloc" 630123 "__bitcode" 579898 "version.txt" 530652 ".data" 457126 "__compact_unwind" 393021 "__gcc_except_tab" 387403 "__eh_frame" 312569 ".rodata" 312238 "CERTIFICATE" 265511 "__init.py" 235670 ".rdata" 230514 "Makefile" 214357 ".dynamic" 212477 ".dynsym" 208787 "pathname" 208787 "asset.meta" 208594 ".dynstr" 201074 ".eh_frame" 198758 ".symtab" 194931 ".init" 194760 ".note.gnu.build-id" 194215 ".strtab" 185339 "__const" 184099 "English.dat" 182945 ".gnu.version" 176370 "Master.js" 175495 ".gnu.version_r"
This is yet another interesting statistic that tells us that what we believed to be just large file hashset is actually a mix of files hashes and hashes of sections of executable files. These are useful in malware analysis, but not so much in forensic time-saving exercise that relies on excluding files with known hashes. In other words, if you blindly use NSRL hashset on your forensic images, you are wasting time testing correlations that by the sole nature of NSRL data set will be a waste of CPU cycles. The only set applied to the file system should be hashes of actual files, not their chunks.
The third lesson is pretty obvious — if you plan on using NSRL hashset it’s best to have different sets for different operating systems and for that, we can leverage OpSystemCode field (use a OS-specific subset for the target image). Oh, but not so fast. There are 1300+ OS versions listed inside NSRLOS.txt file. The file includes Amstrad, MSDOS, Novell Netware, NextStep, AIX and many other kinda derelict platforms. If you plan on using NSRL it’s probably good to simply exclude files belonging to these ancient systems first.
Let’s be honest – there is a lot of value in NSRL hashset, but there is always more than just one “but” and I have listed a few. This is not to discourage using it, but be a bit more choosy how we use it and selectively cherry-pick subsets of its data for the time-sensitive analysis.