Techniques for authorship attribution
The previous section described the importance of authorship attribution and obfuscation. This section will focus on the attribution aspect—how we can design and build models to pinpoint the author of a given text.
Dataset
There has been prior research in the field of authorship attribution and obfuscation. The standard dataset for benchmarking on this task is the Brennan-Greenstadt Corpus. This dataset was collected through a survey at a university in the United States. 12 authors were recruited, and each author was required to submit a pre-written text that comprised at least 5,000 words.
A modified and improved version of this data—called the Extended Brennan-Greenstadt Corpus—was released later by the same authors. To generate this dataset, the authors conducted a large-scale survey by recruiting participants from Amazon Mechanical Turk (MTurk). MTurk is a platform that allows researchers and scientists to conduct...