Reader small image

You're reading from  Mastering Clojure Data Analysis

Product typeBook
Published inMay 2014
Reading LevelBeginner
Publisher
ISBN-139781783284139
Edition1st Edition
Languages
Right arrow
Author (1)
Eric Richard Rochester
Eric Richard Rochester
author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester

Right arrow

Chapter 10. Modeling Stock Data

Automated stock analysis has gotten a lot of press recently. High-frequency trading firms are a flashpoint. People either believe that they're great for the markets and increasing liquidity, or that they're precursors to the apocalypse. Smaller traders have also gotten into the mix in a slower fashion. Some sites, such as Quantopian (https://www.quantopian.com/) and AlgoTrader (http://www.algotrader.ch/) provide services that allow you to create models for automated trading. Many others allow you to use automated analysis to inform your trading decisions.

Whatever your view of this phenomena, it's an area with a lot of data begging to be analyzed. It's also a nice domain in which to experiment with some analysis and machine learning techniques.

For this chapter, we're going to look for relationships between news articles and stock prices in the future.

In the course of this chapter, we will cover the following topics:

  • Learn about financial data analysis

  • Set up...

Learning about financial data analysis


Finance has always relied heavily on data. Earnings statements, forecasting, and portfolio management are just some of the areas that make use of data to quantify their decisions. Because of this, financial data analysis and its related field, financial engineering, are extremely broad fields that are difficult to summarize in a short amount of space.

However, lately, quantitative finance, high-frequency trading, and similar fields have gotten a lot of press and really come into their own. As I mentioned, some people hate them and the added volatility that the markets seem to have. Others maintain that they bring the necessary liquidity that helps the market function better.

All of these fields apply statistical or machine learning methods to financial data. Some of these techniques can be quite simple. Others are more sophisticated. Some of these analyses are used to inform a human analyst or manager to make better financial decisions. Others are used...

Setting up the basics


Before we really dig into the project and the data, we need to prepare. We'll set up the code and the library, and then we'll download the data.

Setting up the library

First, we'll need to initialize the library. We can do this using Leiningen 2 (http://leiningen.org/) and Stuart Sierra's reloaded plugin for it (https://github.com/stuartsierra/reloaded). This will initialize the development environment and project.

To do this, just execute the following command at the prompt (I've named the project financial in this case):

lein new reloaded financial

Now, we can specify the libraries that we'll need to use. We can do this in the project.clj file. Open it and replace its current contents with the following lines:

(defproject financial "0.1.0-SNAPSHOT":dependencies [[org.clojure/clojure "1.5.1"][org.clojure/data.xml "0.0.7"][org.clojure/data.csv "0.1.2"][clj-time "0.6.0"][me.raynes/fs "1.4.4"][org.encog/encog-core "3.1.0"][enclog "0.6.3"]]:profiles
  {:dev {:dependencies...

Getting prepared with data


As usual, now we need to clean up the data and put it into a shape that we can work with. The news article dataset particularly will require some attention, so let's turn our attention to it first.

Working with news articles

The OANC is published in an XML format that includes a lot of information and annotations about the data. Specifically, this marks off:

  • Sections and chapters

  • Sentences

  • Words with part-of-speech lemma

  • Noun chunks

  • Verb chunks

  • Named entities

However, we want the option to use raw text later when the system is actually being used. Because of that, we will ignore the annotations and just extract the raw tokens. In fact, all we're really interested in is each document's text—either as a raw string or a feature vector—and the date it was published. Let's create a record type for this.

We'll put this into the types.clj file in src/financial/. Put this simple namespace header into the file:

(ns financial.types)

This data record will be similarly simple. It can...

Analyzing the text


Our goal for analyzing the news articles is to generate a vector space model of the collection of documents. This attempts to pull the salient features for the documents into a vector of floating-point numbers. Features can be words or information from the documents' metadata encoded for the vector. The feature values can be 0 or 1 for presence, an integer for raw frequency, or the frequency scaled in some form.

In our case, we'll use the feature vector to represent a selection of the tokens in a document. Often, we can use all the tokens, or all the tokens that occur more than once or twice. However, in this case, we don't have a lot of data, so we'll need to be more selective in the features that we include. We'll consider how we select these in a few sections.

For the feature values, we'll use a scaled version of the token frequency called term frequency-inverse document frequency (tf-idf). There are good libraries for this, but this is a basic metric in working with...

Inspecting the stock prices


Now that we have some hold on the textual data, let's turn our attention to the stock prices. Previously, we loaded it from the CSV file using the financial.csv-data/read-stock-prices function. Let's reload that data with the following commands:

user=> (def stock (csvd/read-stock-prices "d/d-1995-2001.csv"))
user=> (count stock)
1263

Let's start with a graph that shows how the closing price has changed over the years:

So the price started in the low 30s, fluctuated a bit, and finished in the low 20s. During that time, there were some periods where it climbed rapidly. Hopefully, we'll be able to capture and predict those changes.

Merging text and stock features


Before we can start to train the neural network, however, we'll need to figure out how we need to represent the data and what information the neural network needs to have.

The code for this section will be present in the src/financial/nn.clj file. Open it up and add the following namespace header:

(ns financial.nn
  (:require [clj-time.core :as time]
            [clj-time.coerce :as time-coerce]
            [clojure.java.io :as io]
            [enclog.nnets :as nnets]
            [enclog.training :as training]
            [financial.utils :as u]
            [financial.validate :as v])
  (:import [org.encog.neural.networks PersistBasicNetwork]))

However, we first need to be clear about what we're trying to do. That will allow us to properly format and present the data.

Let's break it down like this: for each document, based on the previous stock prices and the tokens in a document, can we predict the direction of future stock prices.

So one set of features will...

Analyzing both text and stock features together with neural nets


We now have everything ready to perform the analysis, except for the engine that will actually attempt to learn the training data.

In this instance, we're going to try to train an artificial neural network to learn the direction of change of the future prices of the input data. In other words, we'll try to train it to tell whether the price will go up or down in the near future. We want to create a simple binary classifier from the past price changes and the text of an article.

Understanding neural nets

As the name implies, artificial neural networks are machine learning structures modeled on the architecture and behavior of neurons, such as the ones found in the human brain. Artificial neural networks come in many forms, but today we're going to use one of the oldest and most common forms: the three-layer feed-forward network.

We can see the structure of a unit outlined in the following figure:

Each unit is able to realize linearly...

Predicting the future


Now is the time to bring together everything that we've assembled over the course of this chapter, so it seems appropriate to start over from scratch, just using the Clojure source code that we've written over the course of the chapter.

We'll take this one block at a time, loading and processing the data, creating training and test sets, training and validating the neural network, and finally viewing and analyzing its results.

Before we do any of this, we'll need to load the proper namespaces into the REPL. We can do that with the following require statement:

user=> (require
         [me.raynes.fs :as fs]
         [financial]
         [financial.types :as t]
         [financial.nlp :as nlp]
         [financial.nn :as nn]
         [financial.oanc :as oanc]
         [financial.csv-data :as csvd]
         [financial.utils :as u])

This will give us access to everything that we've implemented so far.

Loading stock prices

First, we'll load the stock prices with the following...

Taking it with a grain of salt


Any analysis like the one presented in this chapter has a number of things that we need to question. This chapter is no exception.

Related to this project

The main weakness of this project was that it was carried out on far too little data. This cuts in several ways:

  • We need articles from a number of data sources

  • We need articles from a wider range of time

  • We need more density of articles in the time period

For all of these, there are reasons we didn't address the issues in this chapter. However, if you plan to take this further, you'd need to figure out some way around these.

There are several ways to look at the results too. The day we looked at, the results all clustered close to zero. In fact, this stock if relatively stable, so if it always indicated little change, then it would always have a fairly low SSE. Large changes seem to happen occasionally, and the error from not predicting them has a low impact on the SSE.

Related to machine learning and market modeling...

Summary


Over the course of this chapter, we've gotten a hold of some news articles and some stock prices, and we've managed to train a neural network that projects just a little into the future. This is a risky thing to put into production, but we've also outlined what we'd need to learn to do this correctly.

And this is also the end of this book. Thank you for staying with me this far. You've been a great reader. I hope that you've learned something as we've looked at the 10 data analysis projects that we've covered. If programming and data are both eating this world, hopefully you've seen how to have fun with both.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Clojure Data Analysis
Published in: May 2014Publisher: ISBN-13: 9781783284139
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester