Back to Research & Publication
In stock price prediction, there are 2 schools of thought: the Effecient Market Hypothesis(EMH) and Adapative Market Hypothesis (AMH). AMH states that stock prices are predictable and can be profited from. In this project, I want to explore the possibility of that.
There are many factors that affect the price movements of stocks. However, it is a difficult feat to identify all of them. Moreover, it is impossible to encapsulate all these factors in a model as the problem size will be too huge for computation. Thus, in this project I will focus on a few technical indicators as features for my machine learning models.
The ultimate aim of this project is not to predict stock prices accurately as that is virtually impossible given my level of expertise. Rather, I will be comparing various methods and models to predict stock prices, and identify what makes them successful or unsuccessful. In my attempt to predict percentage change in stock prices, I will be looking at 2 broad categories of models: univariate and multivariate. The univariate model will serve as a benchmark for my project. Ultimately, I hope to be able to design a multivariate model that works better than the univariate model. In doing so, I will have successfully utilise the various technical indicators, and also develop a simple framework that can be used to predict changes in stock prices more accurately.
Due to the complexity and uncertainty of this problem, I have redesigned the methodology multiple times. After much contemplation, I have broken down my methodology into the following steps
1. Feature Selection
2. Target selection
3. Model Selection
4. Model assessment
5. Conclusion
There are 4 major categories of technical indicators: Volatility, Trend, Momentum, Volume
Due to time constraints, I was unable to select the best amongst the many different indicators that are commonly used. However, I decided to use at least 1 indicator from each category for my model. This is to ensure that my project covers the minimum breath or scope of technical indicators.
Aside from these technical indicators, other possible features would include the percentage change in price 1,2 and 3 days ago, as these intuitively seem to carry information of what tomorrow’s percentage change in price would be.
Fundamental indicators was removed from the scope of the project. This is because it is difficult to identify strong fundamental indicators, especially for short-term price predictions.
After consulting seniors, setting percentage change in stock prices as a target variable/ dependent variable seemed logical because afterall, traders are more interested in the percentage change in stock prices rather than its absolute value when traders trade stocks.
Let:
However, when identifying relationships between the features and this target variable, the correlation is weak, showing a possibility that these features might have negligible influence on the percentage changes in stock prices. The way to resolve this issue is to either change the features or change the target variable.
Unlike percentage change which is a relative value, open price is an absolute value. The pairplots below show stronger correlations. In particular, the technical Average True Range seem to have a linear relationship with the open price. The accumulation distribution index seem to have a polynomial relationship with the open price.
While it seems as though open price is a better target variable to predict as compared to percentage change in price, this is not conclusive. This is because while the technical indicators might have low correlation with percentage change in price independently, when used together, they might still have strong correlation with percentage price.
On further analyses, this might be possible. Looking at the correlation between technical indicators, we can see that they have a relatively low correlation. Therefore, this increases the potential that when used together, the technical indicators can potentially have strong correlation with percentage change in price.
The univariate model that I have chosen is the ARIMA model. This model is highly recommended by the online community. As for the multivariate models, I have chosen Random Forest Regression and Multiple Linear Regression. While SVM was recommended in this research paper, my attempts at using SVM proved that it is not an efficient model to use. The training time takes too long and given the time and resource (computational power) of my project, I did not use SVM for my analyses.
I carried out a 80-20 split on the entirety of my data set. The first 80% will be used as my training set and is where I will carry out a cross-validation analysis. The last 20% will be used as a test set.
The first phase of model assessment was during the cross validation. Since this is a time-series problem, I employed the use of a time-series cross validation method instead of the usual K-fold cross-validation. This is because each data point is not independent of one another, and a sequential train-validation split must be used.
Another issue was the basis of comparison. The ARIMA model is more effective for short-term predictions. This would mean that if the test set in the validation phase was too large, the model will have to make predictions very far ahead of time, which will be disadvantageous to the ARIMA model. In light of this, I reduced the test sets to a window of 5 points into the future. For every iteration in the cross-validation phase, the trained model will only have to predict 5 timesteps into the future.
My cross-validation has 10 splits, resulting in a generation of 10 RMSE values. I then compared the RMSE values of each of the 3 models.
In this section, I look at the model’s performance on the test set. All 3 models are trained on the 80% training set initialized at the start.
The ARIMA model is given and advantage because of its nature to be able to only predict short-term. The ARIMA model is retrained with all possible data before predicting. A better way of explaining this is as follows:
Let:
The ARIMA model is first trained on Ya and then predicts y1. It is then re-initialised and trains on [Ya + y1] before predicting y2. This way, the ARIMA model only has to predict one timestep into the future.
The predicted values are plotted as a yellow line and the true values are plotted as the blue line. Both ARIMA (left-most) and Multiple Linear Regression (right-most) seem to follow quite closely to the trend of the true values.
The first observation is that yet again, the multivariate Linear Regression Model outperforms the ARIMA model, having a lower RMSE score of 0.0761 as compared to 0.101.
The second observation is that while the models seem to follow the trend of the true value quite closely, it is not accurate enough. As seen in the predicted values of the ARIMA model in the table below, the true and predicted values differ up to a value of 1 dollar. It is good that the models managed to generate the right trend. However, more fine-tuning has to be done in order for them to predict stock prices with precision.
As mentioned at the start, the project was not aimed to develop a machine learning model that precisely predict stock prices. Rather, in this project I achieved 3 main things:
This project serves as a comprehensive foreground to stock price analysis. In order to accurately predict stock prices, a lot more experimentations, research and analysis have to be done.
While this project has achieved it’s goal and prove to be successful, it opened up many doors for future work to fully harness the potential of technical analysis in stock price prediction. Below is a list of work I did not have the time to carry out, but will be useful in enhancing the analyses done in this project
Smola, Alex J., and Bernhard Schölkopf. “A Tutorial on Support Vector Regression.” Statistics and Computing, vol. 14, no. 3, 2004, pp. 199–222., doi:10.1023/b:stco.0000035301.49549.88.
Beyaz, Erhan, et al. “Comparing Technical and Fundamental Indicators in Stock Price Forecasting.” 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2018, doi:10.1109/hpcc/smartcity/dss.2018.00262.
Behl, Smarth, et al. “A Machine Learning Based Stock Trading Framework Using Technical and Economic Analysis.”
Shah, Dev, et al. “Stock Market Analysis: A Review and Taxonomy of Prediction Techniques.” International Journal of Financial Studies, vol. 7, no. 2, 2019, p. 26., doi:10.3390/ijfs7020026.
Heo, Junyoung, and Jin Yong Yang. “Stock Price Prediction Based on Financial Statements Using SVM.” International Journal of Hybrid Information Technology, vol. 9, no. 2, 2016, pp. 57–66., doi:10.14257/ijhit.2016.9.2.05.
Shen, Shurong, et al. “Stock Market Forecasting Using Machine Learning Algorithms.”
Fall 2019
Shane Lim