Reader small image

You're reading from  10 Machine Learning Blueprints You Should Know for Cybersecurity

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781804619476
Edition1st Edition
Right arrow
Author (1)
Rajvardhan Oak
Rajvardhan Oak
author image
Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak

Right arrow

Detecting Machine-Generated Text

In the previous chapter, we discussed deepfakes, which are synthetic media that can depict a person in a video and show the person to be saying or doing things that they did not say or do. Using powerful deep learning methods, it has been possible to create realistic deepfakes that cannot be distinguished from real media. Similar to such deepfakes, machine learning models have also succeeded in creating fake text – text that is generated by a model but appears to be written by a human. While the technology has been used to power chatbots and develop question-answering systems, it has also found its use in several nefarious applications.

Generative text models can be used to enhance bots and fake profiles on social networking sites. Given a prompt text, the model can be used to write messages, posts, and articles, thus adding credibility to the bot. A bot can now pretend to be a real person, and a victim might be fooled because of the realistic...

Technical requirements

Text generation models

In the previous chapter, we saw how machine learning models can be trained to generate images of people. The images generated were so realistic that it was impossible in most cases to tell them apart from real images with the naked eye. Along similar lines, machine learning models have made great progress in the area of text generation as well. It is now possible to generate high-quality text in an automated fashion using deep learning models. Just like images, this text is so well written that it is not possible to distinguish it from human-generated text.

Fundamentally, a language model is a machine learning system that is able to look at a part of a sentence and predict what comes next. The words predicted are appended to the existing sentence, and this newly formed sentence is used to predict what will come next. The process continues recursively until a specific token denoting the end of the text is generated. Note that when we say that the next word...

Naïve detection

In this section, we will focus on naïve methods for detecting bot-generated text. We will first create our own dataset, extract features, and then apply machine learning models to determine whether a particular text is machine-generated or not.

Creating the dataset

The task we will focus on is detecting bot-generated fake news. However, the concepts and techniques we will learn are fairly generic and can be applied to parallel tasks such as detecting bot-generated tweets, reviews, posts, and so on. As such a dataset is not readily available to the public, we will create our own.

How are we creating our dataset? We will use the News Aggregator dataset (https://archive.ics.uci.edu/ml/datasets/News+Aggregator) from the UCI Dataset Repository. The dataset contains a set of news articles (that is, links to the articles on the web). We will scrape these articles, and these are our human-generated articles. Then, we will use the article title as a prompt...

Transformer methods for detecting automated text

In the previous sections, we have used traditional hand-crafted features, automated bag of words features, as well as embedding representations for text classification. We saw the power of BERT as a language model in the previous chapter. While describing BERT, we referenced that the embeddings generated by BERT can be used for downstream classification tasks. In this section, we will extract BERT embeddings for our classification task.

The embeddings generated by BERT are different from those generated by the Word2Vec model. Recall that in BERT, we use the masked language model and a transformer-based architecture based on attention. This means that the embedding of a word depends on the context in which it occurs; based on the surrounding words, BERT knows which other words to pay attention to and generate the embedding.

In traditional word embeddings, a word will have the same embedding, irrespective of the context. The word...

Summary

In this chapter, we described approaches and techniques for detecting bot-generated fake news. With the rising prowess of artificial intelligence and the widespread availability of language models, attackers are using automated text generation to run bots on social media. These sock-puppet accounts can generate real-looking responses, posts, and, as we saw, even news-style articles. Data scientists in the security space, particularly those working in the social media domain, will often be up against attackers who leverage AI to spew out text and carpet-bomb a platform.

This chapter aims to equip practitioners against such adversaries. We began by understanding how text generation exactly works and created our own dataset for machine learning experiments. We then used a variety of features (hand-crafted, TF-IDF, and word embeddings) to detect the bot-generated text. Finally, we used contextual embeddings to build improved mechanisms.

In the next chapter, we will study...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
10 Machine Learning Blueprints You Should Know for Cybersecurity
Published in: May 2023Publisher: PacktISBN-13: 9781804619476
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak