Microsoft and Intel have recently collaborated on a new research project that explored a new approach to detecting and classifying malware.
Called STAMINA (STAtic Malware-as-Image Network Analysis), the project relies on a new technique that converts malware samples into grayscale images and then scans the image for textural and structural patterns specific to malware samples.
The Intel-Microsoft research team said the entire process followed a few simple steps. The first consisted of taking an input file and converting its binary form into a stream of raw pixel data.
Researchers then took this one-dimensional (1D) pixel stream and converted it into a 2D photo so that normal image analysis algorithms can analyze it.
The width of the image was selected based on the input file's size, using the table below. The height was dynamic, and resulted from dividing the raw pixel stream by the chosen width value.
After assembling the raw pixel stream into a normal-looking 2D image, researchers then resized the resulting photo to a smaller dimension.
The Intel and Microsoft team said that resizing the raw image did not "negatively impact the classification result," and this was a necessary step so that the computational resources won't have to work with images consisting of billions of pixels, which would most likely slow down processing.
The resized images were then fed into a pre-trained deep neural network (DNN) that scanned the image (2D representation of the malware strain) and classified it as clean or infected.
Microsoft says it provided a sample of 2.2 million infected PE (Portable Executable) file hashes to serve as a base for the research.
Researchers used 60% of the known malware samples to train the original DNN algorithm, 20% of the files to validate the DNN, and the other 20% for the actual testing process.
The research team said STAMINA achieved an accuracy of 99.07% in identifying and classifying malware samples, with a false positives rate of 2.58%.
"The results certainly encourage the use of deep transfer learning for the purpose of malware classification," said Jugal Parikh and Marc Marino, the two Microsoft researchers who participated in the research on behalf of the Microsoft Threat Protection Intelligence Team.
The research is part of Microsoft's recent efforts of improving malware detection using machine learning techniques.
STAMINA used a technique called deep learning. Deep learning is a subset of machine learning (ML), a branch of artificial intelligence (AI), which refers to intelligent computer networks that are capable of learning on their own from input data that is stored in an unstructured or unlabeled format -- in this case, a random malware binary.
Microsoft said that while STAMINA was accurate and fast when working with smaller files, it faultered with larger ones.
"For bigger size applications, STAMINA becomes less effective due to limitations in converting billions of pixels into JPEG images and then resizing them," Microsoft said in a blog post last week.
However, this most likely doesn't matter, as the project could be used for small files only, with excellent results.
In an interview with ZDNet earlier this month, Tanmay Ganacharya, Director for Security Research of Microsoft Threat Protection, said that Microsoft now heavily relies on machine learning for detecting emerging threats, and this system uses a different machine learning modules that are being deployed on customer systems or Microsoft servers.
Microsoft now uses client-side machine learning model engines, cloud-side machine learning model engines, machine learning modules for capturing sequences of behaviors or capturing the content of the file itself, Ganacharya said.
Based on the reported results, STAMINA could be very well one of those ML modules that we may soon see implemented at Microsoft as a way to spot malware.
Currently, Microsoft can make this approach work better than other companies primarily because of the sheer data it possesses from the hundreds of millions of Windows Defender installs.
"Anybody can build a model, but the labeled data and the quantity of it and the quality of it, really helps train the machine learning models appropriately and hence defines how effective they are going to be," Ganacharya said.
"And we, at Microsoft, have that as an advantage because we do have sensors that are bringing us lots of interesting signals through email, through identity, through the endpoint, and being able to combine them."