Reader small image

You're reading from  Machine Learning for Algorithmic Trading - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839217715
Edition2nd Edition
Languages
Right arrow
Author (1)
Stefan Jansen
Stefan Jansen
author image
Stefan Jansen

Stefan is the founder and CEO of Applied AI. He advises Fortune 500 companies, investment firms, and startups across industries on data & AI strategy, building data science teams, and developing end-to-end machine learning solutions for a broad range of business problems. Before his current venture, he was a partner and managing director at an international investment firm, where he built the predictive analytics and investment research practice. He was also a senior executive at a global fintech company with operations in 15 markets, advised Central Banks in emerging markets, and consulted for the World Bank. He holds Master's degrees in Computer Science from Georgia Tech and in Economics from Harvard and Free University Berlin, and a CFA Charter. He has worked in six languages across Europe, Asia, and the Americas and taught data science at Datacamp and General Assembly.
Read more about Stefan Jansen

Right arrow

Text Data for Trading – Sentiment Analysis

This is the first of three chapters dedicated to extracting signals for algorithmic trading strategies from text data using natural language processing (NLP) and machine learning (ML).

Text data is very rich in content but highly unstructured, so it requires more preprocessing to enable an ML algorithm to extract relevant information. A key challenge consists of converting text into a numerical format without losing its meaning. We will cover several techniques capable of capturing the nuances of language so that they can be used as input for ML algorithms.

In this chapter, we will introduce fundamental feature extraction techniques that focus on individual semantic units, that is, words or short groups of words called tokens. We will show how to represent documents as vectors of token counts by creating a document-term matrix and then proceed to use it as input for news classification and sentiment analysis. We will also...

ML with text data – from language to features

Text data can be extremely valuable given how much information humans communicate and store using natural language. The diverse set of data sources relevant to financial investments range from formal documents like company statements, contracts, and patents, to news, opinion, and analyst research or commentary, to various types of social media postings or messages.

Numerous and diverse text data samples are available online to explore the use of NLP algorithms, many of which are listed among the resources included in this chapter's README file on GitHub. For a comprehensive introduction, see Jurafsky and Martin (2008).

To realize the potential value of text data, we'll introduce the specialized NLP techniques and the most effective Python libraries, outline key challenges particular to working with language data, introduce critical elements of the NLP workflow, and highlight NLP applications relevant for algorithmic...

From text to tokens – the NLP pipeline

In this section, we will demonstrate how to construct an NLP pipeline using the open-source Python library spaCy. The textacy library builds on spaCy and provides easy access to spaCy attributes and additional functionality.

Refer to the notebook nlp_pipeline_with_spaCy for the following code samples, installation instruction, and additional details.

NLP pipeline with spaCy and textacy

spaCy is a widely used Python library with a comprehensive feature set for fast text processing in multiple languages. The usage of the tokenization and annotation engines requires the installation of language models. The features we will use in this chapter only require the small models; the larger models also include word vectors that we will cover in Chapter 16, Word Embeddings for Earnings Calls and SEC Filings.

With the library installed and linked, we can instantiate a spaCy language model and then apply it to the document. The result...

Counting tokens – the document-term matrix

In this section, we first introduce how the bag-of-words model converts text data into a numeric vector space representations. The goal is to approximate document similarity by their distance in that space. We then proceed to illustrate how to create a document-term matrix using the sklearn library.

The bag-of-words model

The bag-of-words model represents a document based on the frequency of the terms or tokens it contains. Each document becomes a vector with one entry for each token in the vocabulary that reflects the token's relevance to the document.

Creating the document-term matrix

The document-term matrix is straightforward to compute given the vocabulary. However, it is also a crude simplification because it abstracts from word order and grammatical relationships. Nonetheless, it often achieves good results in text classification quickly and, thus, provides a very useful starting point.

The...

NLP for trading

Once text data has been converted into numerical features using the NLP techniques discussed in the previous sections, text classification works just like any other classification task.

In this section, we will apply these preprocessing techniques to news articles, product reviews, and Twitter data and teach various classifiers to predict discrete news categories, review scores, and sentiment polarity.

First, we will introduce the naive Bayes model, a probabilistic classification algorithm that works well with the text features produced by a bag-of-words model.

The code samples for this section are in the notebook news_text_classification.

The naive Bayes classifier

The naive Bayes algorithm is very popular for text classification because its low computational cost and memory requirements facilitate training on very large, high-dimensional datasets. Its predictive performance can compete with more complex models, provides a good baseline, and...

Summary

In this chapter, we explored numerous techniques and options to process unstructured data with the goal of extracting semantically meaningful numerical features for use in ML models.

We covered the basic tokenization and annotation pipeline and illustrated its implementation for multiple languages using spaCy and TextBlob. We built on these results to build a document model based on the bag-of-words model to represent documents as numerical vectors. We learned how to refine the preprocessing pipeline and then used the vectorized text data for classification and sentiment analysis.

We have two more chapters on alternative text data. In the next chapter, we will learn how to summarize texts using unsupervised learning to identify latent topics. Then, in Chapter 16, Word Embeddings for Earnings Calls and SEC Filings, we will learn how to represent words as vectors that reflect the context of word usage, a technique that has been used very successfully to provide richer...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning for Algorithmic Trading - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839217715
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stefan Jansen

Stefan is the founder and CEO of Applied AI. He advises Fortune 500 companies, investment firms, and startups across industries on data & AI strategy, building data science teams, and developing end-to-end machine learning solutions for a broad range of business problems. Before his current venture, he was a partner and managing director at an international investment firm, where he built the predictive analytics and investment research practice. He was also a senior executive at a global fintech company with operations in 15 markets, advised Central Banks in emerging markets, and consulted for the World Bank. He holds Master's degrees in Computer Science from Georgia Tech and in Economics from Harvard and Free University Berlin, and a CFA Charter. He has worked in six languages across Europe, Asia, and the Americas and taught data science at Datacamp and General Assembly.
Read more about Stefan Jansen