You're reading from 10 Machine Learning Blueprints You Should Know for Cybersecurity

Product typeBook

Published inMay 2023

PublisherPackt

ISBN-139781804619476

Edition1st Edition

Concepts

Machine Learning

Author (1)

Rajvardhan Oak

Attributing Authorship and How to Evade It

The internet has provided the impetus to the fundamental right of freedom of expression by providing a public platform for individuals to voice their opinions, thoughts, findings, and concerns. Any person can express their views through an article, a blog post, or a video and post it online, free of charge in some cases (such as on Blogspot, Facebook, or YouTube). However, this has also led to malicious actors being able to generate misinformation, slander, libel, and abusive content freely. Authorship attribution is a task where we identify the author of a text based on the contents. Attributing authorship can help law enforcement authorities trace hate speech and threats to the perpetrator, or help social media companies detect coordinated attacks and Sybil accounts.

On the other hand, individuals may wish to remain anonymous as authors. They may want to protect their identity to avoid scrutiny or public interest. This is where authorship...

Technical requirements

You can find the code files for this chapter on GitHub at https://github.com/PacktPublishing/10-Machine-Learning-Blueprints-You-Should-Know-for-Cybersecurity/tree/main/Chapter%207.

Authorship attribution and obfuscation

In this section, we will discuss exactly what authorship attribution is and the incentives for designing attribution systems. While there are some very good reasons for doing so, there are some nefarious ones as well; we will therefore also discuss the importance of obfuscation to protect against attacks by nefarious attackers.

What is authorship attribution?

Authorship attribution is the task of identifying the author of a given text. The fundamental idea behind attribution is that different authors have different styles of writing that will reflect in the vocabulary, grammar, structure, and overall organization of the text. Attribution can be based on heuristic methods (such as similarity, common word analysis, or manual expert analysis). Recent advances in machine learning (ML) have also made it possible to build classifiers that can learn to detect the author of a given text.

Authorship attribution is not a new problem—the...

Techniques for authorship attribution

The previous section described the importance of authorship attribution and obfuscation. This section will focus on the attribution aspect—how we can design and build models to pinpoint the author of a given text.

Dataset

There has been prior research in the field of authorship attribution and obfuscation. The standard dataset for benchmarking on this task is the Brennan-Greenstadt Corpus. This dataset was collected through a survey at a university in the United States. 12 authors were recruited, and each author was required to submit a pre-written text that comprised at least 5,000 words.

A modified and improved version of this data—called the Extended Brennan-Greenstadt Corpus—was released later by the same authors. To generate this dataset, the authors conducted a large-scale survey by recruiting participants from Amazon Mechanical Turk (MTurk). MTurk is a platform that allows researchers and scientists to conduct...

Techniques for authorship obfuscation

So far, we have seen how authorship can be attributed to the writer and how to build models to detect the author. In this section, we will turn to the authorship obfuscation problem. Authorship obfuscation, as discussed in the initial section of this chapter, is the art of purposefully manipulating the text to strip it of any stylistic features that might give away the author.

The code is inspired by an implementation that is freely available online (https://github.com/asad1996172/Obfuscation-Systems) with a few minor tweaks.

First, we will import the required libraries. The most important library here is the Natural Language Toolkit (NLTK) library (https://www.nltk.org/) developed by Stanford. This library contains standard off-the-shelf implementations for several natural language processing (NLP) tasks such as tokenization, part-of-speech (POS) tagging, named entity recognition (NER), and so on. It has a powerful set of functionalities...

Summary

This chapter focused on two important problems in security and privacy. We began by discussing authorship attribution, a task of identifying who wrote a particular piece of text. We designed a series of linguistic and text-based features and trained ML models for authorship attribution. Then, we turned to authorship obfuscation, a task that aims to evade the attribution models by making changes to the text such that author-identifying characteristics and style markers are removed. We looked at a series of obfuscation methods for this. For both tasks, we looked at the improvements that could be made to the performance.

Both authorship attribution and obfuscation have important applications in cybersecurity. Attribution can be used to detect Sybil accounts, trace cybercriminals, and protect intellectual property rights. Similarly, obfuscation can help preserve the anonymity of individuals and provide privacy guarantees. This chapter enables ML practitioners in cybersecurity...

The rest of the chapter is locked

You have been reading a chapter from

10 Machine Learning Blueprints You Should Know for Cybersecurity

Published in: May 2023Publisher: PacktISBN-13: 9781804619476

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Rajvardhan Oak

Rajvardhan Oak is a cybersecurity expert, researcher, and scientist with a focus on machine learning solutions to security issues such as fake news, malware, and botnets. He obtained his bachelor's degree from the University of Pune, India, and his master's degree from the University of California, Berkeley. He has served on the editorial committees of multiple technical conferences and journals. His work has been featured by prominent news outlets such as WIRED magazine and the Daily Mail. In 2022, he received the ISC2 Global Achievement Award for Excellence in Cybersecurity. He is based in the Seattle area and works for Microsoft as an applied scientist in the ads fraud division.
Read more about Rajvardhan Oak

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages