Malware Neutralization

In this blog post, I will describe a novel idea for software security named "Malware Neutralization". At the moment, everything said in here is theoretic. There will be no implementation, or even a PoC. It could be a new topic for furthur research in the future, I don't know.

Motivation

Ever download something online and got "infected"? The problem with overwhelming distribution of software (legit and non-legit) is the validity of the software. Software downloaded could be infected or modified with malware components. And not only software, documents file can be infected too.

How are we dealing with this issue? Easy answer, we remove them as soon as we detect them as "malware". Is this the best solution? Are there any other way to solve this issue? I propose Malware Neutralization.

Concept

So what is Malware Neutralization? The concept is easy to explain in a few lines. We remove malware components in the binary (be it software or documents) while keeping other components. This effectively keep the "good" and does not remove the file from running.

The steps to make this work can be illustrated below:

Detect malicious components
Remove malicious components
Repair the binary

In the following sections, we will go into each of these steps and discuss their technical view.

Steps

Detecting malicious components

This step is easy to understand, we must be able to detect the malicious components for a given binary file. This involves a compilation of all malware infection techniques (I shall call this malware embeding). If we do not know about the techniques used for malware embeding, we cannot deploy a good detection method.

Detecting the malicious components cannot be easily applying YARA signatures or normally detection based on heuristics. These detection techniques are for quick classification of software. In this context, a quick classification is not enough, we demand all malicious components be found to carry on. This requirement is strict and might appear hard to find a fully working method.

I suggest using Machine Learning or Deep Learning model to tackle this problem. Although I am not an AI guy, but with my limited knowledge of malware detection, I believe this way is the fastest.

Removing malicious components

After all malicious components are found, the next step is simply removing them from the binary. This could be simply done through overwriting their place with a series of dummy bytes, i.g., 0x00. In practical scenario, this involves direct assembly patching for executable binaries or encoding/decoding of documents files (OLEs, or zip streams).

The hardest part of this step is probably how much should we remove. Without evidence, I guess that malicious components might spans over a large part of the binary, but the detection might only be able to discover a part of them. This might strongly effect how we approach the removal step, mostly on the degree of removal. Should a whole function be removed or only a part of what is detected is removed. These questions are subtopic to be researched.

Repairing the binary

And the last step, repairing the binary. People with less familiarity with Program Synthesis might not understand what Repairing means. So I give a short description on Automatic Program Repair, a subtopic of Program Synthesis.

Automated program repair is an emerging suite of technologies for automatically fixing errors or vulnerabilities—bugs, colloquially—in software systems. Automatic program repair as a research field focuses on a class of techniques that produces source code-level patches for such bugs, of the same variety that programmers produce in addressing a defect they find in their own programs or in response to a bug report. Thus, at a high level, an automatic repair approach takes as input a program and some evidence that the program has a bug (commonly, a failing test) and produces a patch for that program’s source to fix that bug, ideally without negatively influencing other correct functionality.

I found this in the introduction of the book Automatic Program Repair.

So what does that have to do with this step? Obviously, we only want to neutralize malware components found in the binary. However, some of the techniques for malware embeding might involve strong binding with the underlying valid components. Thus even after you successfully remove all malware components, the binary cannot be executed successfully. To solve this issue, I propose using Automatic Program Repair to recover the removed components without breaking the execution flow.

Ofcourse, this proposed method contains multiple problems that should be looked at independently to complete this step effectively. One of the first problem is how should we build constrainst to fill this removed part. For assembly, it could be maintaining the stacks and registers, but that is open to arguements. Another problem that might arise is the algorithm used to define the correct "fixes". This could be done through logic examination of programs (Program Semantics) or through Machine Learning / Deep Learning or even through LLM. Of course, these "algorithms" must compete in their effectiveness, robustness, and speed.

This step is proposed to use a relatively novel technology called Program Synthesis. There could be other technology for the recovering/repairing of the removed parts to render the program executable without errors emerging from our removal of malicious components.

Conclusion

To recap, we define Malware Neutralization as a process to neutralize a binary, either software executable or documents files or other types of files susceptible to malware embeding. We also list out an overall step-by-step to perform this process, together with their technical problems.

Malware Neutralization are relevant to nowadays software and files distribution. Distribution of files can be unsafe due to many factors. And by removing the malicious components embeded inside the file, neutralization, the file can be used normally without fear of malware infection.

The proposed idea is an innovative way of ensuring safety to our daily lives of files downloads. While still maintain the overall content of the file to be executed or read, without having to remove them as awhole when flagged as a malware.

I should emphasize that this is not for malware with only malware functionality.

Research?

I leave my idea open to the world. Researchers interested in this problem can carry on the research with the following guidelines described above (or not, you are free to explore all methods). This research should be a joint research of several topics together, including Malware Analysis, Program Analysis, Binary Analysis, Assembly, Machine Learning, Deep Learning, Large Language Modeling, Program Synthesis, Program Semantics, Formal Assembly, as such. It will be hard to tackle all of them at once, I suggest tackle what is familiar with you first and solve them seperately. After all steps are implemented, a PoC should be produced to prove the overall performance.

I may work on this problem when I have an opportunity and when my research is aligned with my target. In the near future, I might work on something else not relating to this idea. However, I would love to hear from researchers taking my idea to the test.

7.5 KiB Raw Blame History Unescape Escape