Exploring Predictive Models for Stock Prices

Back to Research & Publication

Exploring Predictive Models for Stock Prices



Introduction

In stock price prediction, there are 2 schools of thought: the Effecient Market Hypothesis(EMH) and Adapative Market Hypothesis (AMH). AMH states that stock prices are predictable and can be profited from. In this project, I want to explore the possibility of that.

There are many factors that affect the price movements of stocks. However, it is a difficult feat to identify all of them. Moreover, it is impossible to encapsulate all these factors in a model as the problem size will be too huge for computation. Thus, in this project I will focus on a few technical indicators as features for my machine learning models.

Objective

The ultimate aim of this project is not to predict stock prices accurately as that is virtually impossible given my level of expertise. Rather, I will be comparing various methods and models to predict stock prices, and identify what makes them successful or unsuccessful. In my attempt to predict percentage change in stock prices, I will be looking at 2 broad categories of models: univariate and multivariate. The univariate model will serve as a benchmark for my project. Ultimately, I hope to be able to design a multivariate model that works better than the univariate model. In doing so, I will have successfully utilise the various technical indicators, and also develop a simple framework that can be used to predict changes in stock prices more accurately.

Methodology

Due to the complexity and uncertainty of this problem, I have redesigned the methodology multiple times. After much contemplation, I have broken down my methodology into the following steps

1. Feature Selection

  • Generate technical indicators that will be used as input variables for my model

2. Target selection

  • Due to the unpredictability of stock price movements, it may or may not be easy to predict percentage changes in stock prices. Feature-target correlation will have to be analysed to determine that.

3. Model Selection

  • Choosing which univariate model to use
  • Choosing which multivariate models to use

4. Model assessment

  • Train the models and use them to predict results
  • Compare results across all models (RMSE values)

5. Conclusion

  • Outcome and insights generated from this project
  • Future work

Feature Selection

There are 4 major categories of technical indicators: Volatility, Trend, Momentum, Volume

Due to time constraints, I was unable to select the best amongst the many different indicators that are commonly used. However, I decided to use at least 1 indicator from each category for my model. This is to ensure that my project covers the minimum breath or scope of technical indicators.

Aside from these technical indicators, other possible features would include the percentage change in price 1,2 and 3 days ago, as these intuitively seem to carry information of what tomorrow’s percentage change in price would be.

Fundamental indicators was removed from the scope of the project. This is because it is difficult to identify strong fundamental indicators, especially for short-term price predictions.

Target Selection

After consulting seniors, setting percentage change in stock prices as a target variable/ dependent variable seemed logical because afterall, traders are more interested in the percentage change in stock prices rather than its absolute value when traders trade stocks.

Target Variable – Percentage Change in close price

Let:

  • T = today’s close price
  • Y = yesterday’s close price
  • Percentage change = (T-Y)/Y

However, when identifying relationships between the features and this target variable, the correlation is weak, showing a possibility that these features might have negligible influence on the percentage changes in stock prices. The way to resolve this issue is to either change the features or change the target variable.



Target Variable – Open Price

Unlike percentage change which is a relative value, open price is an absolute value. The pairplots below show stronger correlations. In particular, the technical Average True Range seem to have a linear relationship with the open price. The accumulation distribution index seem to have a polynomial relationship with the open price.







Comparison

While it seems as though open price is a better target variable to predict as compared to percentage change in price, this is not conclusive. This is because while the technical indicators might have low correlation with percentage change in price independently, when used together, they might still have strong correlation with percentage price.

On further analyses, this might be possible. Looking at the correlation between technical indicators, we can see that they have a relatively low correlation. Therefore, this increases the potential that when used together, the technical indicators can potentially have strong correlation with percentage change in price.



Model Selection

The univariate model that I have chosen is the ARIMA model. This model is highly recommended by the online community. As for the multivariate models, I have chosen Random Forest Regression and Multiple Linear Regression. While SVM was recommended in this research paper, my attempts at using SVM proved that it is not an efficient model to use. The training time takes too long and given the time and resource (computational power) of my project, I did not use SVM for my analyses.

Model Assessment

I carried out a 80-20 split on the entirety of my data set. The first 80% will be used as my training set and is where I will carry out a cross-validation analysis. The last 20% will be used as a test set.

Cross Validation

Set-up

The first phase of model assessment was during the cross validation. Since this is a time-series problem, I employed the use of a time-series cross validation method instead of the usual K-fold cross-validation. This is because each data point is not independent of one another, and a sequential train-validation split must be used.



Another issue was the basis of comparison. The ARIMA model is more effective for short-term predictions. This would mean that if the test set in the validation phase was too large, the model will have to make predictions very far ahead of time, which will be disadvantageous to the ARIMA model. In light of this, I reduced the test sets to a window of 5 points into the future. For every iteration in the cross-validation phase, the trained model will only have to predict 5 timesteps into the future.

Results analysis

My cross-validation has 10 splits, resulting in a generation of 10 RMSE values. I then compared the RMSE values of each of the 3 models.





Test Set

In this section, I look at the model’s performance on the test set. All 3 models are trained on the 80% training set initialized at the start.

The ARIMA model is given and advantage because of its nature to be able to only predict short-term. The ARIMA model is retrained with all possible data before predicting. A better way of explaining this is as follows:

Let:

  • Ya = Open price used as the initial training data (the first 80% of values as initialized)
  • Yb = Test set with data points (y1, y2 … yk).

The ARIMA model is first trained on Ya and then predicts y1. It is then re-initialised and trains on [Ya + y1] before predicting y2. This way, the ARIMA model only has to predict one timestep into the future.

Results analysis




The predicted values are plotted as a yellow line and the true values are plotted as the blue line. Both ARIMA (left-most) and Multiple Linear Regression (right-most) seem to follow quite closely to the trend of the true values.

The first observation is that yet again, the multivariate Linear Regression Model outperforms the ARIMA model, having a lower RMSE score of 0.0761 as compared to 0.101.

The second observation is that while the models seem to follow the trend of the true value quite closely, it is not accurate enough. As seen in the predicted values of the ARIMA model in the table below, the true and predicted values differ up to a value of 1 dollar. It is good that the models managed to generate the right trend. However, more fine-tuning has to be done in order for them to predict stock prices with precision.



Conclusion

Outcome of the project

As mentioned at the start, the project was not aimed to develop a machine learning model that precisely predict stock prices. Rather, in this project I achieved 3 main things:

  1. Designed a methodology for stock price analysis
  2. Compared the effectiveness of different models
  3. Proved the potential of technical analysis

This project serves as a comprehensive foreground to stock price analysis. In order to accurately predict stock prices, a lot more experimentations, research and analysis have to be done.

Future work

While this project has achieved it’s goal and prove to be successful, it opened up many doors for future work to fully harness the potential of technical analysis in stock price prediction. Below is a list of work I did not have the time to carry out, but will be useful in enhancing the analyses done in this project

  1. Hyperparameter tuning
  2. Feature engineering: Create or use more features instead of just 4 technical indicators and previous-days prices
  3. Fundamental analysis might have strong impacts on longer-term predictions

References

Smola, Alex J., and Bernhard Schölkopf. “A Tutorial on Support Vector Regression.” Statistics and Computing, vol. 14, no. 3, 2004, pp. 199–222., doi:10.1023/b:stco.0000035301.49549.88.

Beyaz, Erhan, et al. “Comparing Technical and Fundamental Indicators in Stock Price Forecasting.” 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2018, doi:10.1109/hpcc/smartcity/dss.2018.00262.

Behl, Smarth, et al. “A Machine Learning Based Stock Trading Framework Using Technical and Economic Analysis.”

Shah, Dev, et al. “Stock Market Analysis: A Review and Taxonomy of Prediction Techniques.” International Journal of Financial Studies, vol. 7, no. 2, 2019, p. 26., doi:10.3390/ijfs7020026.

Heo, Junyoung, and Jin Yong Yang. “Stock Price Prediction Based on Financial Statements Using SVM.” International Journal of Hybrid Information Technology, vol. 9, no. 2, 2016, pp. 57–66., doi:10.14257/ijhit.2016.9.2.05.

Shen, Shurong, et al. “Stock Market Forecasting Using Machine Learning Algorithms.”