You're reading from 10 Machine Learning Blueprints You Should Know for Cybersecurity

Product type Book

Published in May 2023

Publisher Packt

ISBN-13 9781804619476

Pages 330 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Author (1):

Rajvardhan Oak

Table of Contents (15) Chapters

Preface

1. Chapter 1: On Cybersecurity and Machine Learning

2. Chapter 2: Detecting Suspicious Activity

3. Chapter 3: Malware Detection Using Transformers and BERT

4. Chapter 4: Detecting Fake Reviews

5. Chapter 5: Detecting Deepfakes

6. Chapter 6: Detecting Machine-Generated Text

7. Chapter 7: Attributing Authorship and How to Evade It

8. Chapter 8: Detecting Fake News with Graph Neural Networks

9. Chapter 9: Attacking Models with Adversarial Machine Learning

10. Chapter 10: Protecting User Privacy with Differential Privacy

11. Chapter 11: Protecting User Privacy with Federated Machine Learning

12. Chapter 12: Breaking into the Sec-ML Industry

13. Index

Why subscribe?

14. Other Books You May Enjoy

Attributing Authorship and How to Evade It

The internet has provided the impetus to the fundamental right of freedom of expression by providing a public platform for individuals to voice their opinions, thoughts, findings, and concerns. Any person can express their views through an article, a blog post, or a video and post it online, free of charge in some cases (such as on Blogspot, Facebook, or YouTube). However, this has also led to malicious actors being able to generate misinformation, slander, libel, and abusive content freely. Authorship attribution is a task where we identify the author of a text based on the contents. Attributing authorship can help law enforcement authorities trace hate speech and threats to the perpetrator, or help social media companies detect coordinated attacks and Sybil accounts.

On the other hand, individuals may wish to remain anonymous as authors. They may want to protect their identity to avoid scrutiny or public interest. This is where authorship...

Technical requirements

You can find the code files for this chapter on GitHub at https://github.com/PacktPublishing/10-Machine-Learning-Blueprints-You-Should-Know-for-Cybersecurity/tree/main/Chapter%207.

Authorship attribution and obfuscation

In this section, we will discuss exactly what authorship attribution is and the incentives for designing attribution systems. While there are some very good reasons for doing so, there are some nefarious ones as well; we will therefore also discuss the importance of obfuscation to protect against attacks by nefarious attackers.

What is authorship attribution?

Authorship attribution is the task of identifying the author of a given text. The fundamental idea behind attribution is that different authors have different styles of writing that will reflect in the vocabulary, grammar, structure, and overall organization of the text. Attribution can be based on heuristic methods (such as similarity, common word analysis, or manual expert analysis). Recent advances in machine learning (ML) have also made it possible to build classifiers that can learn to detect the author of a given text.

Authorship attribution is not a new problem—the...

Techniques for authorship attribution

The previous section described the importance of authorship attribution and obfuscation. This section will focus on the attribution aspect—how we can design and build models to pinpoint the author of a given text.

Dataset

There has been prior research in the field of authorship attribution and obfuscation. The standard dataset for benchmarking on this task is the Brennan-Greenstadt Corpus. This dataset was collected through a survey at a university in the United States. 12 authors were recruited, and each author was required to submit a pre-written text that comprised at least 5,000 words.

A modified and improved version of this data—called the Extended Brennan-Greenstadt Corpus—was released later by the same authors. To generate this dataset, the authors conducted a large-scale survey by recruiting participants from Amazon Mechanical Turk (MTurk). MTurk is a platform that allows researchers and scientists to conduct...

Techniques for authorship obfuscation

So far, we have seen how authorship can be attributed to the writer and how to build models to detect the author. In this section, we will turn to the authorship obfuscation problem. Authorship obfuscation, as discussed in the initial section of this chapter, is the art of purposefully manipulating the text to strip it of any stylistic features that might give away the author.

The code is inspired by an implementation that is freely available online (https://github.com/asad1996172/Obfuscation-Systems) with a few minor tweaks.

First, we will import the required libraries. The most important library here is the Natural Language Toolkit (NLTK) library (https://www.nltk.org/) developed by Stanford. This library contains standard off-the-shelf implementations for several natural language processing (NLP) tasks such as tokenization, part-of-speech (POS) tagging, named entity recognition (NER), and so on. It has a powerful set of functionalities...

Summary

This chapter focused on two important problems in security and privacy. We began by discussing authorship attribution, a task of identifying who wrote a particular piece of text. We designed a series of linguistic and text-based features and trained ML models for authorship attribution. Then, we turned to authorship obfuscation, a task that aims to evade the attribution models by making changes to the text such that author-identifying characteristics and style markers are removed. We looked at a series of obfuscation methods for this. For both tasks, we looked at the improvements that could be made to the performance.

Both authorship attribution and obfuscation have important applications in cybersecurity. Attribution can be used to detect Sybil accounts, trace cybercriminals, and protect intellectual property rights. Similarly, obfuscation can help preserve the anonymity of individuals and provide privacy guarantees. This chapter enables ML practitioners in cybersecurity...