Reader small image

You're reading from  10 Machine Learning Blueprints You Should Know for Cybersecurity

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781804619476
Edition1st Edition
Right arrow
Author (1)
Rajvardhan Oak
Rajvardhan Oak
author image
Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak

Right arrow

Malware Detection Using Transformers and BERT

Malware refers to malicious software applications that run on computers, smartphones, and other devices for nefarious purposes. They execute surreptitiously in the background, and often, users are not even aware that their device is infected with malware. They can be used to steal sensitive user information (such as passwords or banking information) and share it with an adversary, use your device resources for cryptocurrency mining or click fraud, or corrupt your data (such as deleting photos and emails) and ask for a ransom to recover it. In the 21st century, where smartphones are our lifeline, malware can have catastrophic effects. Learning how to identify, detect, and remove malware is an important and emerging problem in cybersecurity.

Because of its ability to identify and learn patterns in behavior, machine learning techniques have been applied to detect malware. This chapter will begin with an overview of malware including its...

Technical requirements

Basics of malware

Before we learn about detecting malware, let us briefly understand what exactly malware is and how it works.

What is malware?

Malware is simply any malicious software. It will install itself on your device (such as a computer, tablet, or smartphone) and operate in the background, often without your knowledge. It is designed to quietly change files on your device, and thus steal or corrupt sensitive information. Malware is generally camouflaged and pretends to be an otherwise innocent application. For example, a browser extension that offers free emojis can actually be malware that is secretly reading your passwords and siphoning them off to a third party.

Devices can be infected by malware in multiple ways. Here are some of the popular vectors attackers exploit to deliver malware to a user device:

  • Leveraging the premise of “free” software, such as a cracked version of expensive software such as Adobe Photoshop
  • USB devices with the...

Malware detection

As the prevalence of malware grows, so does the need for detecting it. Routine system scans and analysis by malware detection algorithms can help users stay safe and keep their systems clean.

Malware detection methods

Malware detection can be divided broadly into three main categories: signature-based, behavioral-based, and heuristic methods. In this section, we will look at what these methods are in short and also discuss techniques for analysis.

Signature-based methods

These methods aim to detect malware by storing a database of known malware examples. All applications are checked against this database to identify whether they are malicious. The algorithm examines each application and calculates a signature using a hash function. In computer security, the hash of a file can be treated as its unique identity. It is nearly impossible to have two files with the same hash unless they are identical. Therefore, this method works really well in detecting known...

Transformers and attention

Transformers are an architecture taking the machine learning world by storm, especially in the fields of natural language processing. An improvement over classical recurrent neural networks (RNN) for sequence modeling, transformers work on the principle of attention. In this section, we will discuss the attention mechanism, transformers, and the BERT architecture.

Understanding attention

We will now take a look at attention, a recent deep learning paradigm that has made great advances in the world of natural language processing.

Sequence-to-sequence models

Most natural language tasks rely heavily on sequence-to-sequence models. While traditional methods are used for classifying a particular data point, sequence-to-sequence architectures map sequences in one domain to sequences in another. An excellent example of this is language translation. An automatic machine translator will take in sequences of tokens (sentences and words) from the source...

Detecting malware with BERT

So far, we have seen attention, transformers, and BERT. But all of it has been very specific to language-related tasks. How is all of what we have learned relevant to our task of malware detection, which has nothing to do with language? In this section, we will first discuss how we can leverage BERT for malware detection and then demonstrate an implementation of the same.

Malware as language

We saw that BERT shows excellent performance on sentence-related tasks. A sentence is merely a sequence of words. Note that we as humans find meaning in a sequence because we understand language. Instead of words, the tokens could be anything: integers, symbols, or images. So BERT performs well on sequence tasks.

Now, imagine that instead of words, our tokens were calls made by an application. The life cycle of an application could be described as a series of API calls it makes. For instance, <START> <REQUEST-URL> <DOWNLOAD-FILE> <EXECUTE...

Summary

This chapter provided an introduction to malware and a hands-on blueprint for how it can be detected using transformers. First, we discussed the concepts of malware and the various forms they come in (rootkits, viruses, and worms). We then discussed the attention mechanism and transformer architecture, which are recent advances that have taken the machine learning world by storm. We also looked at BERT, a model that has beat several baselines in tasks such as sentence classification and question-answering. We leveraged BERT for malware detection by fine-tuning a pre-trained model on API call sequence data.

Malware is a pressing problem that places users of phones and computers at great risk. Data scientists and machine learning practitioners who are interested in the security space need to have a strong understanding of how malware works and the architecture of models that can be used for detection. This chapter provided all of the knowledge needed and is a must to master...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
10 Machine Learning Blueprints You Should Know for Cybersecurity
Published in: May 2023Publisher: PacktISBN-13: 9781804619476
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak