Malware Signature Generation - Mid Trial
In recent research, I've discovered a few things about malware signature generation (MSG) and the whole model that surrounds it. Most of this is just speculation, which would explain the lack of citations. However, I
would like to expand on what we have and create a smarter product.
As I understand it, MSG is based on just understanding exploits that have been created and basically blacklisting and whitelisting code that has already been written. The problem is there is an infinite number of ways a task can be completed, so having a complete and inclusive list (or even a list up to date with the most recent hacks) is nearly impossible. With the plethora of technologies involved in a single web page request, that probability of having a fully inclusive list of exploits is even more stark.
Last week, I had a theory that if one were to compile source code to bytecode or binary, then you could inspect the result of that to determine if similar plaintext code would have the same binary result once compiled. I toyed around with the idea by creating two javascript files with the same code, except a few lines were re-arranged. The function was still the same, but the order in which some actions took place was different. The binary result was different. I tried compiling it with exactly the same function, but the name of a variable was different. Just like before, the .class files were different. So using Rhino to compile JavaScript isn't proving to be a consistent method of identifying bytecode signatures.
I think a less kludge-like method of identifying malware would be to parse either the binary, bytecode or source code of the malicious scripts and heuristically identify what the code is trying to do. If the code is obfuscated over a few layers, makes requests for unwarranted remote resources, extends into other languages to fetch unwarranted remote resources, and/or attempts to download files to your computer without your consent or knowledge, then it would be classified as something at least suspicious and marked for later review. Then, once we find that logic signature again, we can disinfect it in some fashion or notify the end-user.
I will have to explore more about lexical parsers, how they work and what data I can see with them to understand if this theory holds. If this pans out, I'm sure other big companies like this are already implementing this, which is why I'm not fearful about putting this idea out into the wild.
would like to expand on what we have and create a smarter product.
As I understand it, MSG is based on just understanding exploits that have been created and basically blacklisting and whitelisting code that has already been written. The problem is there is an infinite number of ways a task can be completed, so having a complete and inclusive list (or even a list up to date with the most recent hacks) is nearly impossible. With the plethora of technologies involved in a single web page request, that probability of having a fully inclusive list of exploits is even more stark.
Last week, I had a theory that if one were to compile source code to bytecode or binary, then you could inspect the result of that to determine if similar plaintext code would have the same binary result once compiled. I toyed around with the idea by creating two javascript files with the same code, except a few lines were re-arranged. The function was still the same, but the order in which some actions took place was different. The binary result was different. I tried compiling it with exactly the same function, but the name of a variable was different. Just like before, the .class files were different. So using Rhino to compile JavaScript isn't proving to be a consistent method of identifying bytecode signatures.
I think a less kludge-like method of identifying malware would be to parse either the binary, bytecode or source code of the malicious scripts and heuristically identify what the code is trying to do. If the code is obfuscated over a few layers, makes requests for unwarranted remote resources, extends into other languages to fetch unwarranted remote resources, and/or attempts to download files to your computer without your consent or knowledge, then it would be classified as something at least suspicious and marked for later review. Then, once we find that logic signature again, we can disinfect it in some fashion or notify the end-user.
I will have to explore more about lexical parsers, how they work and what data I can see with them to understand if this theory holds. If this pans out, I'm sure other big companies like this are already implementing this, which is why I'm not fearful about putting this idea out into the wild.
You could probably run many a code in a sandboxed virtual internet? farfetched, but just a thought
ReplyDelete