Analysing NSRL data set for fun and because… curious, Part 3

Analysing NSRL data set for fun and because… curious, Part 3
Nearly two years ago I published a quick summary of my analysis of NSRL data. I believe I was th 2023-9-17 06:11:55 Author: www.hexacorn.com(查看原文) 阅读量:23 收藏

Nearly two years ago I published a quick summary of my analysis of NSRL data. I believe I was the first one to publicly evaluate this data set, and I still stand by the harsh conclusions I reached back then, today. And what makes me really happy about that 2 year old analysis is a small ripple effect that my posts caused…

I really loved this DFIR science follow-up post – not only Joshua followed my steps and delivered some nice data crunching on the NSRL core dataset to confirm/disprove my findings and hypothesis – he also did some actual benchmarking! I think the results of his experiment prove beyond any doubt that when you blindly do garbage in, there will for sure be garbage out. Also known as: you can use NSRL data better. And then Joshua published his Efficient-NSRL tool as well. So, if you use NSRL set in your investigations, you will benefit from taking a look at my older posts, Joshua’s post, and his Efficient-NSRL tool…

Two years later…

The NSRL data set has changed a lot since 2021, so it’s only natural to come back to its recent incarnation to see what has changed…

The first notable change is that the NSRL data is now distributed as a SQLite3 database only. The schema of the database is available and you can find it inside files named like this:

RDS_2023.03.1_modern.schema.sql
RDS_2023.06.1_modern_minimal.schema.sql

To create a textual equivalent of the old NSRLFile.txt file one has to follow the recipe provided inside this PDF. Which, of course doesn’t work, because the already-present FILE view (inside the RDS_2023.03.1_modern.db) does not include the crc32 column/field… but we can fix that easily. We just create a new VIEW called FILE2 that includes tha missing CRC32 column/field:

CREATE VIEW FILE2 AS
    SELECT
        UPPER(md.sha256) AS sha256,
        UPPER(md.sha1) AS sha1,
        UPPER(md.md5) AS md5,
        UPPER(md.crc32) AS crc32,
        CASE md.extension
        WHEN ''
                THEN md.file_name
                ELSE md.file_name||'.'||md.extension
            END AS file_name,
        md.bytes AS file_size,
        po.package_id
    FROM
        METADATA AS md,
        PACKAGE_OBJECT AS po
    WHERE
        md.object_id = po.object_id

and then we run the export query using a FILE2 view:

DROP TABLE IF EXISTS EXPORT;
CREATE TABLE EXPORT AS SELECT sha1, md5, crc32, file_name, file_size, package_id FROM FILE2;
UPDATE EXPORT SET file_name = REPLACE(file_name, '"', '');
.mode csv
.headers off
.output output.txt
SELECT '"' || sha1 || '"', '"' || md5 || '"', '"' || crc32 || '"', '"' || file_name || '"', file_size,
package_id, '"' || 0 || '"', '"' || '"' FROM EXPORT ORDER BY sha1;

or, if we just want file names:

.output filenames.txt
SELECT file_name FROM EXPORT;

These filenames can be then sorted, counted, etc.

There is a lot more file names in the new set, that’s for sure. It went from 16512841 unique file names I observed in a 2021 set to 23676133 in Jan 2023. Still, lots of it is not that useful, because the actual benign (‘good’) source files are being pushed around, their logical chunks carved out, their sections and class files extracted, etc. – same as before, the most frequent ‘file names’ are PE file section names, MSI table names, Java files, etc… And if you missed the memo, hashes of these logical ‘chunks’ are not very useful as you will never find their binary equivalents present on a file system. Unless you forensic suite can apply hashes to PE file sections, MSI tables, .jar class files – all these ‘partial’ hashes are useless when it comes to ‘mark file as a good, NSRL known file’.

The stats for the top file names are now as follows (for RDS_2023.03.1_modern.db):

9081226 1
7850139 .text
5933107 .reloc
5086051 .data
3634652 version.txt
3101066 .rdata
2923472 CERTIFICATE
2784502 __LINKEDIT
2784113 __TEXT__text
2758779 __TEXT__cstring
2735742 __DATA__data
2718505 __DATA__bss
2667173 __TEXT__const
2629651 __DATA__const
2588460 __DATA__common
2437056 __DATA__mod_init_func
2187040 __DATA__mod_term_func
2164991 __DWARF__debug_abbrev
2164534 __DWARF__debug_line
2164534 __DWARF__debug_info
2164532 __DWARF__debug_aranges
2163269 __DWARF__debug_pubnames
2163268 __DWARF__debug_pubtypes
2162599 __DWARF__debug_str
2162336 __DWARF__debug_frame
2161990 __DWARF__debug_loc
2161722 __DWARF__debug_ranges
2159803 __DWARF__apple_objc
2159803 __DWARF__apple_namespac
2159800 __DWARF__apple_types
2159800 __DWARF__apple_names
2158643 __DWARF__debug_inlined
2157348 __HIB__common
2157348 __HIB__bss
2157347 __KLD__bss
2157346 __HIB__const
2157345 __KLD__cstring

We must admit that it’s s hardly useful.

Having said that, you may be surprised that I still like this dataset a lot, and would still recommend using the NSRL set in your investigations. Yes, it’s not ideal, it may cause your forensic boxes some extra cycles, but it’s at least something. And it’s out there, for free. I also respect the efforts a lot, because a few years ago I made a conscious decision to create a competitive set to NSRL and now I do know now how hard it is…

The bottom line is: know and use all available data sets and tools. Just apply them wisely.

文章来源: https://www.hexacorn.com/blog/2023/09/16/analysing-nsrl-data-set-for-fun-and-because-curious-part-3/
如有侵权请联系:admin#unsafe.sh