You're reading from 10 Machine Learning Blueprints You Should Know for Cybersecurity

Product typeBook

Published inMay 2023

PublisherPackt

ISBN-139781804619476

Edition1st Edition

Concepts

Machine Learning

Author (1)

Rajvardhan Oak

Detecting Machine-Generated Text

In the previous chapter, we discussed deepfakes, which are synthetic media that can depict a person in a video and show the person to be saying or doing things that they did not say or do. Using powerful deep learning methods, it has been possible to create realistic deepfakes that cannot be distinguished from real media. Similar to such deepfakes, machine learning models have also succeeded in creating fake text – text that is generated by a model but appears to be written by a human. While the technology has been used to power chatbots and develop question-answering systems, it has also found its use in several nefarious applications.

Generative text models can be used to enhance bots and fake profiles on social networking sites. Given a prompt text, the model can be used to write messages, posts, and articles, thus adding credibility to the bot. A bot can now pretend to be a real person, and a victim might be fooled because of the realistic...

Technical requirements

You can find the code files for this chapter on GitHub at https://github.com/PacktPublishing/10-Machine-Learning-Blueprints-You-Should-Know-for-Cybersecurity/tree/main/Chapter%206.

Text generation models

In the previous chapter, we saw how machine learning models can be trained to generate images of people. The images generated were so realistic that it was impossible in most cases to tell them apart from real images with the naked eye. Along similar lines, machine learning models have made great progress in the area of text generation as well. It is now possible to generate high-quality text in an automated fashion using deep learning models. Just like images, this text is so well written that it is not possible to distinguish it from human-generated text.

Fundamentally, a language model is a machine learning system that is able to look at a part of a sentence and predict what comes next. The words predicted are appended to the existing sentence, and this newly formed sentence is used to predict what will come next. The process continues recursively until a specific token denoting the end of the text is generated. Note that when we say that the next word...

Naïve detection

In this section, we will focus on naïve methods for detecting bot-generated text. We will first create our own dataset, extract features, and then apply machine learning models to determine whether a particular text is machine-generated or not.

Creating the dataset

The task we will focus on is detecting bot-generated fake news. However, the concepts and techniques we will learn are fairly generic and can be applied to parallel tasks such as detecting bot-generated tweets, reviews, posts, and so on. As such a dataset is not readily available to the public, we will create our own.

How are we creating our dataset? We will use the News Aggregator dataset (https://archive.ics.uci.edu/ml/datasets/News+Aggregator) from the UCI Dataset Repository. The dataset contains a set of news articles (that is, links to the articles on the web). We will scrape these articles, and these are our human-generated articles. Then, we will use the article title as a prompt...

Transformer methods for detecting automated text

In the previous sections, we have used traditional hand-crafted features, automated bag of words features, as well as embedding representations for text classification. We saw the power of BERT as a language model in the previous chapter. While describing BERT, we referenced that the embeddings generated by BERT can be used for downstream classification tasks. In this section, we will extract BERT embeddings for our classification task.

The embeddings generated by BERT are different from those generated by the Word2Vec model. Recall that in BERT, we use the masked language model and a transformer-based architecture based on attention. This means that the embedding of a word depends on the context in which it occurs; based on the surrounding words, BERT knows which other words to pay attention to and generate the embedding.

In traditional word embeddings, a word will have the same embedding, irrespective of the context. The word...

Summary

In this chapter, we described approaches and techniques for detecting bot-generated fake news. With the rising prowess of artificial intelligence and the widespread availability of language models, attackers are using automated text generation to run bots on social media. These sock-puppet accounts can generate real-looking responses, posts, and, as we saw, even news-style articles. Data scientists in the security space, particularly those working in the social media domain, will often be up against attackers who leverage AI to spew out text and carpet-bomb a platform.

This chapter aims to equip practitioners against such adversaries. We began by understanding how text generation exactly works and created our own dataset for machine learning experiments. We then used a variety of features (hand-crafted, TF-IDF, and word embeddings) to detect the bot-generated text. Finally, we used contextual embeddings to build improved mechanisms.

In the next chapter, we will study...

The rest of the chapter is locked

You have been reading a chapter from

10 Machine Learning Blueprints You Should Know for Cybersecurity

Published in: May 2023Publisher: PacktISBN-13: 9781804619476

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages