Long Short Term Memory Using Stochastic Gradient Descent and Adam for Stock Prediction

The stock market is a place to carry out stock buying and selling transactions, the expected return of course has a profitable difference. Predicting stock prices can be done in various ways, one of which is by using deep learning models. Long Short Term Memory (LSTM) is a method that can be used to predict time series data. This method is a development of the Recurrent Neural Network (RNN), so this method is more complicated and powerful. To conduct training on the LSTM model, optimization is needed to minimize errors. There are lots of optimizations that can be used, but in this research, we use SGD and Adam. Several parameters such as learning rate 0.01, 0.001, 0.0001 and several variations of epochs such as epoch 25, epoch 50, epoch 100, epoch 200, epoch 400, and epoch 1000 were used in this study. The research data used are stock data of BBRI, BBNI, BMRI, and BBTN. This study also tries to predict stock prices on the next day using 5 historical stock price data, the result is that LSTM SGD and LSTM Adam succeeded in predicting the next day.


INTRODUCTION
Investors can make decisions to buy and sell their shares to achieve high returns with tools [1].Many methods can be used to predict, such as arima [2], ordinary differential equations [3], artificial neural network (ANN) [4], and recurrent neural network (RNN) [5] are some examples of methods that can be used for prediction.
Artificial neural network (ANN) is an artificial intelligence method inspired by neuron networks [6].The ANN architecture has three parts, input layer, output layer, and hidden layer that connects the input layer, and the output layer.The input data on the ANN will be multiplied by the weights and added with the bias, and then the activation function is processed to obtain the output data on the ANN [7].Recurrent neural network (RNN) is a development of ANN which has been specially designed to be able to process time series data.The data used in RNNs often use time series data [8], [9].Data that enters the RNN architecture will be processed repeatedly.
There are problems in conducting training on RNN, namely the vanishing gradient problem [10].A vanishing gradient occurs when the gradient value is too small so have no effect when updating the weights in the backpropagation process [11].The cause of the reduced gradient value is due to the repeated multiplication of the hidden layer RNN when carrying out the backpropagation process [11], [12].Long Short Term Memory (LSTM) is a method proposed by Sepp Hochreiter and Jurgen Schmidhuber in a paper [10] in 1997.This method is a development of RNN.LSTM was developed to prevent vanishing gradient problems that arise when training RNN so that LSTM can process longer time series data than RNNs. 4 additional components become the architecture of the LSTM, namely forget gate, input gate, output gate, and cell state.LSTM can predict stocks with better performance and smaller errors [13].
Optimization methods are used in the ANN architecture, both traditional and developed architectures such as LSTM.The goals of the optimization method are to make improvements to the weight and bias values by minimizing errors in the architecture used.In this study, SGD and Adam were applied to the LSTM.The concept of SGD is to minimize the error value by calculating the gradient using one or several data to perform parameter updates (weights and biases) [14].Adam was developed by [15] in 2015 and is now starting to be widely used in various studies.
This study aims to see the performance of LSTM SGD and LSTM Adam using various learning rates and epochs in training.After getting the smallest error value from each method, the last step is to make predictions using historical data.

Normalization
In research, data with different scales are often encountered [16].For example, there is data worth 100, and other data is worth 10000.When using data that is not scaled, it can cause model training to take longer due to large calculations, in some cases, can lead to an ineffective model [17].This study uses a normalization method that is already quite popular, namely min max scalar.
How min max mcalar works is to use the maximum and minimum data in a data set to generate a new value.Data that has applied by min max scalar will be distributed with a range of 0 to 1 [18].The equation of min max scalar is below: Normalization only changes the scale of the data but does not change the characteristics of the data.Figure 1 shows that the characteristics of the data do not change, and the distribution of the data changed from 0 to 1 using Min Max Scalar.

Long Short Term Memory (LSTM)
Sepp Hochreiter and Jurgen Schmidhuber are the figures behind the emergence of the LSTM method was first introduced to the public in 1997.LSTM is a type of RNN that performs better in practice, due to the updating of the architecture and the dynamics of backpropagation in LSTM [19].The main idea of LSTM is to add cell state and other components [20].LSTM consists of input gate, forget gate and output gate that can adjust the cell state [21].

Gambar 2. Illustration of LSTM Architecture
The process of each component in LSTM has explained by Nurjaman research in [22], and Lipton research in [23].The LSTM process starts at the forget gate.The incoming value is converted to range 0 to 1 with a sigmoid activation function [24].
The input gate serves to create a new value that will be passed to the cell state.Look at figure 4, two processes occur at the input gate, the value that passes through the sigmoid activation function, and passes through the tanh activation function which is called a candidate cell ( ̃).
Cell state is the most important component in the LSTM, because it connects each LSTM cell directly [24], [25].The value of the forget gate and input gate used to update the old cell state to the new one [26].
symbol ⊗ indicates that the multiplication process used is the element-wise product.
after getting the cell state and gate output values, the last step is to calculate the hidden value and become the output of the LSTM.

Activation Function
The activation function is a function to calculate inputs and biases that have been weighted [11].The activation function is divided into linear activation function and nonlinear activation function [27].This study uses two non-linear activation functions.The sigmoid activation function is one of the non-linear activation functions.The sigmoid activation function will change the incoming input to be in the range of 0 to 1 [11].The sigmoid activation function equation can be as follows: the tanh activation function changes the input into a range of -1 to 1 [11].

Forward Pass LSTM
The forward pass is a process to find predictive data and look for errors.Forward pass on LSTM is more complicated than forward pass RNN because there are several additional components.The input data passes through the forget gate, input gate, cell state, and output gate, and the output of the LSTM is obtained.The description of forward pass on LSTM are following steps: 1. Calculate the forget gate value using equation (2).
3. Calculate the cell state value using equation ( 5). 4. Calculate the output gate value using equation ( 6). 5. Calculate the hidden value that becomes the output by using the equation ( 7). 6.Finally, use equation ( 10) to calculate loss.

Backward Pass LSTM
The backward pass is the process of going backward from the forward pass.The purpose of the backward pass is to minimize errors by updating the weights on the architecture so that the learning process becomes better.The explain of the backward pass in LSTM has been done in Greff [20], and Maohua [28] studies.3. Calculating the value of the loss derivatives with respect to the repetition weight and biases of the forget gate, input gate,  ̃, and output gate, is the same as equations ( 11) to ( 18), but must adjust the variables to be derived.4. Adding up the loss derivative values at t = T and t < T 5. Updating input weights, repetition weights, and biases using the SGD and Adam equations

Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a technique for solving deep learning problems.The problem is to find the optimal weight by minimizing the resulting error.SGD is a development of Gradient Descent (GD), where SGD only uses one data from ∇  ( = 1, 2, . . ., ) and choose it randomly or stochastic [29].
At the beginning of the algorithm process it starts by chosen random data.Errors from first process are corrected during repeated process using the gradient rule of the function to be minimized.

Adam
Adam is a method that was introduced in 2015 at research [15].Just like SGD, Adam is used to updating the weights so that learning architecture is better because the error is minimize.The steps in Adam's algorithm are as follows: 1. Initialization  0 = 0 (first moment).

5.
Updating the second moment with equation: Correcting the bias at the first moment with the equation: 7.
Correcting the bias at the second moment with the equation: 8. Updating the weights with the equation: There are several parameters in Adam, such a   is the gradient value, β 1 , and β 2 which values 0.9 and 0.99.The last parameter  which is worth 10 −8 prevents division by 0.

Root Mean Squared Error (RMSE)
Model evaluation needs to find out how well the results obtained from the model used [30].Error prediction is an error that occurs between the predicted data and the actual data.There are many methods to find out the error you get, one of which is the Root Mean Squared Error also known as RMSE.The error value of RMSE will be closer to zero if the error is getting smaller, and it is also a sign that the model is getting better [22].
the  ̂ value is the predicted value obtained from the LSTM forward pass process.

RESULTS AND DISCUSSION
The data is on closing prices for BBRI, BBNI, BMRI, and BBTN shares downloaded from Yahoo Finance! from January 2, 2020, to October 7, 2022.The normalization process applied to each shares, then forms a time series data pattern, and finally becomes input from the LSTM SGD and LSTM Adam.The LSTM prediction process starts from the forget gate, input gate, cell state, output gate using equations ( 2) to (7), this process continues to repeat throughout the time series pattern and LSTM produces the loss value in equation (10).Next, to increase the output value as a predicted value, LSTM uses the Backward process.The aim of this process is to find optimal weights in LSTM.LSTM has four components, so the backward LSTM calculation is based on each component.Because LSTM uses time series data, the backward process starts from the latest data range.The equations used by LSTM at t = T are shown by equations ( 11) to ( 14) and the equations used by LSTM at t < T are shown by equations ( 15) to (18).The process of updating weights and biases uses SGD equation (19), and Adam uses equation (24).Iterations will continue to repeat themselves, and the weights reach optimal when the error difference does not produce a significant difference.
Experimented with various learning rates and epochs, and was found that the average minimum error of the LSTM SGD with a learning rate of 0.01, the average minimum error of the LSTM Adam with a learning rate of 0.001.Table 2 shows the error values of the LSTM SGD: Table 3 shows the error values of the LSTM Adam with learning rate 0.001:  3, which in this study found that in BBRI, BBNI, and BMRI stocks, LSTM Adam produced a better error, while in BBTN stock, LSTM SGD produced a better error.The number of epochs is also highly considered because it will affect the computation time.The next step is to display the actual and predicted plots of the LSTM SGD. a prediction value that is close to the actual data and produce the minimum error.LSTM Adam requires 1000 epochs to the minimum error of 2.13 on BBRI shares.LSTM Adam produces a minimum error on BBNI with a score of 9.32 and requires 50 epochs.LSTM Adam with 100 epochs produces a minimum error of 22.87 on BMRI shares, and LSTM SGD produces a minimum error on 1000 epochs with a score of 2.59 on BBTN shares.
LSTM SGD and LSTM Adam have successfully trained.The next step is to try to make predictions for the next ten days using five historical data and using a model that produces the minimumt error such as LSTM Adam on BBRI, BBNI, and BMRI shares, and LSTM SGD on BBTN shares.:

CONCLUSIONS
Based on the results of the study, can be concluded that LSTM SGD and LSTM Adam can be trained well.LSTM Adam with a learning rate of 0.001 produces a smaller error value in training, also succeeded in making predictions for the next ten days using the LSTM SGD and LSTM Adam.The result is that both methods can predict stock prices in the next ten days using five historical data.
Suggestions for further research is to focus on testing various time series data patterns, aims to find out how long the time series data can be used in LSTM to prevent vanishing gradient problems and try to experiment with variants of the LSTM architecture, such as Stacked-LSTM.

Figure 1 .
Figure 1.Illustration of Data Before Normalization (blue) and After Normalization (red)

Figure 3 .
Figure 3. Forget Gate LSTM the forget gate equation on the LSTM can be seen as follows:

Figure 4 .
Figure 4. Input Gate LSTM the equation for the two input gate processes can be seen as follows:

Figure 5 .
Figure 5. Cell State LSTM the equation for performing the update is:

Figure 7 .
Figure 7. Forward Pass LSTM to Generate Output

Figure 8 .
Figure 8. Backward Pass LSTM to Update The Weights

Figure 9 .Figure 10 .
Figure 9. Plot Actual and Predicted Value from LSTM SGD Training, (a) is BBRI, (b) is BBNI, (c) is BMRI, and (d) is BBTN figure 10 shows the plot actual and predicted value of the LSTM Adam with learning rate 0.001:

Figure 11 .
Figure 11.Plot Prediction from, (a) BBRI by LSTM Adam, (b) BBNI by LSTM Adam, (c) BMRI by LSTM Adam, and (d) BBTN by LSTM SGD plot 11(a) shows stock predictions in an uptrend, plot 11(b) and 11(c) shows stock predictions in an downtrend and stagnant, and plot 11(d) shows stock predictions in an uptrend for the next ten days.
Calculating the value of the loss derivative with respect to the input weight of the forget gate, input gate,  ̃, and output gate at t < T

Table 1 .
Data that used on this research

Table 2 .
Error LSTM SGD with Learning Rate 0.01

Table 3 .
Error LSTM Adam with Learning Rate 0.001

Table 4 .
Predict The Next Ten Days with LSTM SGD and LSTM Adam figure 11, shows a prediction plot for the next ten days.