In financial portfolios, the returns on their constituent assets depend on a number of factors, such as macroeconomic and microeconomical conditions, and various financial variables. As the number of factors increases, so does the complexity involved in modeling portfolio behavior. Given that computing resources are finite, coupled with time constraints, performing an extra computation for a new factor only increases the bottleneck on portfolio modeling calculations. A linear technique for dimensionality reduction is Principal Component Analysis (PCA). As its name suggests, PCA breaks down the movement of portfolio asset prices into its principal components, or common factors, for further statistical analysis. Common factors that don't explain much of the movement of the portfolio assets receive less weighting in their factors and...
You're reading from Mastering Python for Finance. - Second Edition
The Dow Jones industrial average and its 30 components
The Dow Jones Industrial Average (DJIA) is a stock market index that comprises the 30 largest US companies. Commonly known as the Dow, it is owned by S&P Dow Jones Indices LLC and computed on a price-weighted basis (see https://us.spindices.com/index-family/us-equity/dow-jones-averages for more information on the Dow).
This section involves downloading the datasets of Dow and its components into pandas DataFrame objects for use in later sections of this chapter.
Downloading Dow component datasets from Quandl
The following code retrieves the Dow component datasets from Quandl. The data provider that we will be using is WIKI Prices, a community formed by members...
Applying a kernel PCA
In this section, we will perform kernel PCA to find eigenvectors and eigenvalues so that we can reconstruct the Dow index.
Finding eigenvectors and eigenvalues
We can perform a kernel PCA using the KernelPCA class of the sklearn.decomposition module in Python. The default kernel method is linear. The dataset that's used in PCA is required to be normalized, which we can perform with z-scoring. The following code do this:
In [ ]:
from sklearn.decomposition import KernelPCA
fn_z_score = lambda x: (x - x.mean()) / x.std()
df_z_components = daily_df_components.apply(fn_z_score)
fitted_pca = KernelPCA().fit(df_z_components)
The fn_z_score variable is an inline function to perform...
Stationary and non-stationary time series
It is important that time series data that's used for statistical analysis is stationary in order to perform statistical modeling correctly, as such usages may be for prediction and forecasting. This section introduces the concepts of stationarity and non-stationarity in time series data.
Stationarity and non-stationarity
In empirical time series studies, price movements are observed to drift toward some long-term mean, either upwards or downwards. A stationary time series is one whose statistical properties, such as mean, variance, and autocorrelation, are constant over time. Conversely, observations on non-stationary time series data have their statistical properties...
The Augmented Dickey-Fuller Test
An Augmented Dickey-Fuller Test (ADF) is a type of statistical test that determines whether a unit root is present in time series data. Unit roots can cause unpredictable results in time series analysis. A null hypothesis is formed on the unit root test to determine how strongly time series data is affected by a trend. By accepting the null hypothesis, we accept the evidence that the time series data is non-stationary. By rejecting the null hypothesis, or accepting the alternative hypothesis, we accept the evidence that the time series data is generated by a stationary process. This process is also known as trend-stationary. Values of the ADF test statistic are negative. Lower values of ADF indicates stronger rejection of the null hypothesis.
Here are some basic autoregression models for use in ADF testing:
- No constant and no trend:
- A...
Analyzing a time series with trends
Let's examine a time series dataset. Take, for example, the prices of gold futures traded on the CME. On Quandl, the gold futures continuous contract is available for download with the following code: CHRIS/CME_GC1. This data is curated by the Wiki Continuous Futures community group, taking into account the front month contracts only. The sixth column of the dataset contains the settlement prices. The following code downloads the dataset from the year 2000 onward:
In [ ]:
import quandl
QUANDL_API_KEY = 'BCzkk3NDWt7H9yjzx-DY' # Your Quandl key here
quandl.ApiConfig.api_key = QUANDL_API_KEY
df = quandl.get(
'CHRIS/CME_GC1',
column_index=6,
collapse='monthly',
start_date='2000-01-01')
Examine the head of the dataset using the following command:
In [ ]:
df...
Making a time series stationary
A non-stationary time series data is likely to be affected by a trend or seasonality. Trending time series data has a mean that is not constant over time. Data that is affected by seasonality have variations at specific intervals in time. In making a time series data stationary, the trend and seasonality effects have to be removed. Detrending, differencing, and decomposition are such methods. The resulting stationary data is then suitable for statistical forecasting.
Let's look at all three methods in detail.
Detrending
The process of removing a trend line from a non-stationary data is known as detrending. This involves a transformation step that normalizes large values into smaller ones...
Forecasting and predicting a time series
In the previous section, we identified non-stationarity in time series data and discussed techniques for making time series data stationary. With stationary data, we can proceed to perform statistical modeling such as prediction and forecasting. Prediction involves generating best estimates of in-sample data. Forecasting involves generating best estimates of out-of-sample data. Predicting future values is based on previously observed values. One such commonly used method is the Autoregressive Integrated Moving Average.
About the Autoregressive Integrated Moving Average
The Autoregressive Integrated Moving Average (ARIMA) is a forecasting model for stationary time series based on linear...
Summary
In this chapter, we were introduced to PCA as a dimension reduction technique in portfolio modeling. By breaking down the movement of asset prices of a portfolio into its principal components, or common factors, the most useful factors can be kept, and portfolio analysis can be greatly simplified without compromising on computational time and space complexity. In applying PCA to the Dow and its thirty components using the KernelPCA function of the sklearn.decomposition module, we obtained eigenvectors and eigenvalues, which we used to reconstruct the Dow with five components.
In the statistical analysis of time series data, the data is considered as either stationary or non-stationary. Stationary time series data is data whose statistical properties are constant over time. Non-stationary time series data has its statistical properties change over time, most likely due...