Reader small image

You're reading from  10 Machine Learning Blueprints You Should Know for Cybersecurity

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781804619476
Edition1st Edition
Right arrow
Author (1)
Rajvardhan Oak
Rajvardhan Oak
author image
Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak

Right arrow

Attributing Authorship and How to Evade It

The internet has provided the impetus to the fundamental right of freedom of expression by providing a public platform for individuals to voice their opinions, thoughts, findings, and concerns. Any person can express their views through an article, a blog post, or a video and post it online, free of charge in some cases (such as on Blogspot, Facebook, or YouTube). However, this has also led to malicious actors being able to generate misinformation, slander, libel, and abusive content freely. Authorship attribution is a task where we identify the author of a text based on the contents. Attributing authorship can help law enforcement authorities trace hate speech and threats to the perpetrator, or help social media companies detect coordinated attacks and Sybil accounts.

On the other hand, individuals may wish to remain anonymous as authors. They may want to protect their identity to avoid scrutiny or public interest. This is where authorship...

Technical requirements

Authorship attribution and obfuscation

In this section, we will discuss exactly what authorship attribution is and the incentives for designing attribution systems. While there are some very good reasons for doing so, there are some nefarious ones as well; we will therefore also discuss the importance of obfuscation to protect against attacks by nefarious attackers.

What is authorship attribution?

Authorship attribution is the task of identifying the author of a given text. The fundamental idea behind attribution is that different authors have different styles of writing that will reflect in the vocabulary, grammar, structure, and overall organization of the text. Attribution can be based on heuristic methods (such as similarity, common word analysis, or manual expert analysis). Recent advances in machine learning (ML) have also made it possible to build classifiers that can learn to detect the author of a given text.

Authorship attribution is not a new problem—the...

Techniques for authorship attribution

The previous section described the importance of authorship attribution and obfuscation. This section will focus on the attribution aspect—how we can design and build models to pinpoint the author of a given text.

Dataset

There has been prior research in the field of authorship attribution and obfuscation. The standard dataset for benchmarking on this task is the Brennan-Greenstadt Corpus. This dataset was collected through a survey at a university in the United States. 12 authors were recruited, and each author was required to submit a pre-written text that comprised at least 5,000 words.

A modified and improved version of this data—called the Extended Brennan-Greenstadt Corpus—was released later by the same authors. To generate this dataset, the authors conducted a large-scale survey by recruiting participants from Amazon Mechanical Turk (MTurk). MTurk is a platform that allows researchers and scientists to conduct...

Techniques for authorship obfuscation

So far, we have seen how authorship can be attributed to the writer and how to build models to detect the author. In this section, we will turn to the authorship obfuscation problem. Authorship obfuscation, as discussed in the initial section of this chapter, is the art of purposefully manipulating the text to strip it of any stylistic features that might give away the author.

The code is inspired by an implementation that is freely available online (https://github.com/asad1996172/Obfuscation-Systems) with a few minor tweaks.

First, we will import the required libraries. The most important library here is the Natural Language Toolkit (NLTK) library (https://www.nltk.org/) developed by Stanford. This library contains standard off-the-shelf implementations for several natural language processing (NLP) tasks such as tokenization, part-of-speech (POS) tagging, named entity recognition (NER), and so on. It has a powerful set of functionalities...

Summary

This chapter focused on two important problems in security and privacy. We began by discussing authorship attribution, a task of identifying who wrote a particular piece of text. We designed a series of linguistic and text-based features and trained ML models for authorship attribution. Then, we turned to authorship obfuscation, a task that aims to evade the attribution models by making changes to the text such that author-identifying characteristics and style markers are removed. We looked at a series of obfuscation methods for this. For both tasks, we looked at the improvements that could be made to the performance.

Both authorship attribution and obfuscation have important applications in cybersecurity. Attribution can be used to detect Sybil accounts, trace cybercriminals, and protect intellectual property rights. Similarly, obfuscation can help preserve the anonymity of individuals and provide privacy guarantees. This chapter enables ML practitioners in cybersecurity...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
10 Machine Learning Blueprints You Should Know for Cybersecurity
Published in: May 2023Publisher: PacktISBN-13: 9781804619476
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak