Nearly two years ago I published a quick summary of my analysis of NSRL data. I believe I was the first one to publicly evaluate this data set, and I still stand by the harsh conclusions I reached back then, today. And what makes me really happy about that 2 year old analysis is a small ripple effect that my posts caused…
I really loved this DFIR science follow-up post – not only Joshua followed my steps and delivered some nice data crunching on the NSRL core dataset to confirm/disprove my findings and hypothesis – he also did some actual benchmarking! I think the results of his experiment prove beyond any doubt that when you blindly do garbage in, there will for sure be garbage out. Also known as: you can use NSRL data better. And then Joshua published his Efficient-NSRL tool as well. So, if you use NSRL set in your investigations, you will benefit from taking a look at my older posts, Joshua’s post, and his Efficient-NSRL tool…
Two years later…
The NSRL data set has changed a lot since 2021, so it’s only natural to come back to its recent incarnation to see what has changed…
The first notable change is that the NSRL data is now distributed as a SQLite3 database only. The schema of the database is available and you can find it inside files named like this:
To create a textual equivalent of the old NSRLFile.txt file one has to follow the recipe provided inside this PDF. Which, of course doesn’t work, because the already-present FILE view (inside the RDS_2023.03.1_modern.db) does not include the crc32 column/field… but we can fix that easily. We just create a new VIEW called FILE2 that includes tha missing CRC32 column/field:
CREATE VIEW FILE2 AS SELECT UPPER(md.sha256) AS sha256, UPPER(md.sha1) AS sha1, UPPER(md.md5) AS md5, UPPER(md.crc32) AS crc32, CASE md.extension WHEN '' THEN md.file_name ELSE md.file_name||'.'||md.extension END AS file_name, md.bytes AS file_size, po.package_id FROM METADATA AS md, PACKAGE_OBJECT AS po WHERE md.object_id = po.object_id
and then we run the export query using a FILE2 view:
DROP TABLE IF EXISTS EXPORT; CREATE TABLE EXPORT AS SELECT sha1, md5, crc32, file_name, file_size, package_id FROM FILE2; UPDATE EXPORT SET file_name = REPLACE(file_name, '"', ''); .mode csv .headers off .output output.txt SELECT '"' || sha1 || '"', '"' || md5 || '"', '"' || crc32 || '"', '"' || file_name || '"', file_size, package_id, '"' || 0 || '"', '"' || '"' FROM EXPORT ORDER BY sha1;
or, if we just want file names:
.output filenames.txt SELECT file_name FROM EXPORT;
These filenames can be then sorted, counted, etc.
There is a lot more file names in the new set, that’s for sure. It went from 16512841 unique file names I observed in a 2021 set to 23676133 in Jan 2023. Still, lots of it is not that useful, because the actual benign (‘good’) source files are being pushed around, their logical chunks carved out, their sections and class files extracted, etc. – same as before, the most frequent ‘file names’ are PE file section names, MSI table names, Java files, etc… And if you missed the memo, hashes of these logical ‘chunks’ are not very useful as you will never find their binary equivalents present on a file system. Unless you forensic suite can apply hashes to PE file sections, MSI tables, .jar class files – all these ‘partial’ hashes are useless when it comes to ‘mark file as a good, NSRL known file’.
The stats for the top file names are now as follows (for RDS_2023.03.1_modern.db):
We must admit that it’s s hardly useful.
Having said that, you may be surprised that I still like this dataset a lot, and would still recommend using the NSRL set in your investigations. Yes, it’s not ideal, it may cause your forensic boxes some extra cycles, but it’s at least something. And it’s out there, for free. I also respect the efforts a lot, because a few years ago I made a conscious decision to create a competitive set to NSRL and now I do know now how hard it is…
The bottom line is: know and use all available data sets and tools. Just apply them wisely.