Looking for the randomness in the most non-AI/ML way…

Looking for the randomness in the most non-AI/ML way…
Here’s an old-school file name-based research… it is not game changing, it won’t bring any immed 2023-11-25 08:27:57 Author: www.hexacorn.com(查看原文) 阅读量:36 收藏

Here’s an old-school file name-based research… it is not game changing, it won’t bring any immediate solution, but it’s still worth doing today…

The software we install (focus here is on Windows, as usual) creates a loooot of files, and while many of them seem to be completely random, whimsical in nature, especially with regards to their file names, they do end up forming a corpora of sort… Or, when bundled together, all these file names known to be created for legitimate purposes are a great material for research.

For this post I collected 1.5M executable file names from Windows. They may not be a full set of file names ‘out there’, but it’s enough to play around with….

I then looked at statistics of 2- and 3- and 4-character long infixes (ignoring any non [a-z] characters).

The results are below:

How often 2-character long infixes appear in these 1.5M file names: filename_stats_2.txt – as you can see, not very useful…
How often 3-character long infixes appear in these 1.5M file names: filename_stats_3.txt – not very useful either…
How often 4-character long infixes appear in these 1.5M file names: filename_stats_4.txt – this is better… we definitely can cherry-pick a lot of 4-character long infixes that never appear in the set: filename_stats_4_non-existing.txt

Using the latter, we can create regexes sets:

1 leading character, 3 following: filename_stats_4_non-existing_regex1-3.txt
2 leading characters, 2 following: filename_stats_4_non-existing_regex2-2.txt

Using these regexes sets you may actually get better at finding randomly named filenames! You will also find a lot of FPs, of course, but now you have a set of regexes you can tune to your needs…

Can this be used in ML/AI research?

Yes, by all means, but the set of file names used as a base should be a loooot higher and collected in a more meaningful way. One can argue that f.ex. temporary files created by installers could be excluded, we could also exclude file names that are following certain patterns in names (f.ex. starting with a dollar ‘$’, tilde ‘~’, or file names conforming to a pattern ‘<GUID>.exe’), we could reduce the corpora by understanding versioned file names (f.ex. ‘FirefoxSetup63.exe’, ‘FirefoxSetup64.0.2.exe’, etc.), we could ignore non-English file names (‘Менеджер BIM Сервера GRAPHISOFT 19.exe’, ‘联系汉化作者.exe’, etc.) or, artificially created file names that are used by many ‘download/update’ managers (‘ICReinstall_’ as in ‘ICReinstall_any_video_converter.exe’, ‘ICReinstall_driver identifier.exe’, etc.), or … we could also focus entirely on signed installers only as well, or compiled within a certain timeframe f.ex. last decade).

As I said… it is not game changing, it won’t bring any immediate solution, but it’s still worth doing today…

And I will now answer the ‘why’:

– just to understand how hopeless the whole file name-matching idea is!

文章来源: https://www.hexacorn.com/blog/2023/11/25/looking-for-the-randomness-in-the-most-non-ai-ml-way/
如有侵权请联系:admin#unsafe.sh