Prerequisites : https://coeai19.wixsite.com/coe-ai-bbsr/post/time-series-and-forecasting?postId=1fe2a721-b93c-4e86-97a5-b060f225f14d
By now, we should have a basic understanding of what a Time Series Dataset is and how to use it. The two main goals of Time Series Analysis are mentioned below.
Recognizing Data Patterns
Using the Insights to Make Predictions
Identifying Data-Patterns entails figuring out how the variables are associated, and then using that knowledge to forecast future values.
Models that are most appropriate for the Time Series
In Python and R, there are various Data Science and Machine Learning libraries that include several models for solving Time Series Analysis problems.
The following are some of the most popular Time Series Analysis models:
ARIMA Models
SARIMA Models
Box-Jenkins Multivariate Models
Holt-Winters Exponential Smoothing
Unobserved Components Model
ARIMA model
ACF and PACF
In the auto regression model, I tried to find the relationship between the decent value of the variable and the historical values of the variable. But we also had a question: how do we know how many lag values or historical values we should use?
This is where autocorrelation and partial autocorrelation will help us.
Correlation between two variables is basically a relationship or a connection between two numbers. It is measured by Pearson's correlation coefficient, which ranges between minus one to one. If the value of this number is one, this means that the relationship between the two variables is positive. Positive means that if X increases, Y also increases. And if X decreases, Y also decreases.
On the other extreme is if the value is minus one. It means that if X increases Y decreases and if X decreases, Y increases, that both will move in the opposite direction but both are still correlated. There is a relationship between the two, but correlation is negative.
But when the correlation coefficient is zero, this means that there is no relationship between these two variables. So we cannot say that if X increases, what will happen, then Y it may increase or it may decrease.
But when it comes to Time series, we are trying to find the correlation of the variable with its own lag values, which is why, like auto regression, this correlation is called autocorrelation. That is correlation with itself, its own lag values.
To use it, we first find the correlation with all the lag values.
ACF (Autocorrelation Function) Plot : The lag values are on the x axis. So this first line is at zero that is, it is for lag zero. The second line is for lag one and so on. On the Y axis, we have correlation coefficient value. So with lag's zero value, that is with itself, the correlation coefficient is plus one, which is obvious with lag one value. It is nearly 0.8.
The colored cone at the bottom is called 95 percent confidence interval cone.
Basically, a point outside this cone means that we are more than 95 percent confident that there is a correlation between these variables.
When we are finding correlation between original series and lag 3 values, we should
remove the effect of lag one and like two values. When we remove these effects of intervening observations, then the correlation coefficient is called partial autocorrelation coefficient.
PACF (Partial Autocorrelation Function) Plot
ARIMA (Auto Regression Integrated Moving Average) Model
Steps :
Differencing method is used to remove trends
Auto regression is used on the new series to find the initial set of forecasts
Moving average model is applied on residuals to update the forecast
In the end, differencing is done, that is adding back the lag values to the forecast.
Parameters :
AR - p - Order of auto regression(how many lag variables to choose)
I - d - Order of Integration (Number of differencing needed)
MA - q - Order of moving average (window size for moving average)
ARIMA model in Python :
Month Sales 1-01 266.0
1-02 145.9
1-03 183.1
1-04 119.3
1-05 180.3
df['Sales'].plot()
Trend is Polynomial. There is no seasonality
D =2
Autocorrelation Plot
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(df['Sales'])
Partial Autocorrelation Graph
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(df['Sales'], lags=15)
p = 2
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(df['Sales'], order=(5,2,2))
model_fit = model.fit()
model_fit.summary()
ARIMA Model Results
Dep. Variable: D2.Sales No. Observations: 34
Model: ARIMA(5, 2, 2) Log Likelihood -189.034
Method: css-mle S.D. of innovations 54.343
Date: Tue, 03 Mar 2020 AIC 396.067
Time: 17:25:28 BIC 409.805
Sample: 2 HQIC 400.752 coef std_err z P>|z| [0.025 0.975]
const 0.9167 0.256 3.574 0.001 0.414 1.419
ar.L1.D2.Sales -2.1873 0.184 -11.887 0.000 -2.548 -1.827
ar.L2.D2.Sales -2.1231 0.405 -5.246 0.000 -2.916 -1.330
ar.L3.D2.Sales -1.6013 0.478 -3.353 0.002 -2.537 -0.665
ar.L4.D2.Sales -1.0317 0.409 -2.520 0.018 -1.834 -0.229
ar.L5.D2.Sales -0.3193 0.193 -1.653 0.110 -0.698 0.059
ma.L1.D2.Sales -0.0002 0.120 -0.002 0.998 -0.235 0.234
ma.L2.D2.Sales -0.9998 0.120 -8.363 0.000 -1.234 -0.765
Roots
Real Imaginary Modulus Frequency
AR.1 0.1409 -1.3030j 1.3106 -0.2329
AR.2 0.1409 +1.3030j 1.3106 0.2329
AR.3 -1.1313 -0.4225j 1.2076 -0.4431
AR.4 -1.1313 +0.4225j 1.2076 0.4431
AR.5 -1.2502 -0.0000j 1.2502 -0.5000
MA.1 1.0000 +0.0000j 1.0000 0.0000
MA.2 -1.0002 +0.0000j 1.0002 0.5000
residuals = model_fit.resid
residuals.plot()
residuals.describe()
count 34.000000 mean 11.293855 std 65.989793 min -119.295696 25% -31.207155 50% 12.481503 75% 55.127849 max 156.383323 dtype: float64
Some Variations of ARIMA model
ARIMA - model = ARIMA(df['Sales'], order=(q,d,p))
Autoregression - model = ARIMA(df['Sales'], order=(p,d,0))
Moving Average Model - model = ARIMA(df['Sales'], order=(0,d,q))
output = model_fit.forecast()
output
(array([636.15148334]), array([54.34286347]), array([[529.64142812, 742.66153855]]))
Forecasted value, standard deviation, 95% confidence interval
model_fit.forecast(5)[0]
array([636.15148334, 709.19180983, 664.3330911 , 771.11928552, 761.60381712])
Walk Forward Validation of ARIMA Model
Since training of statistical models isn't time consuming, walk-forward validation is the most preferred solution to urge most accurate results.
train_size = int(df.shape[0]*0.7)
train, test = df.Sales[0:train_size], df.Sales[train_size:]
test.shape
(11,)
data = train
predict =[]
for t in test:
model = ARIMA(data, order=(5,1,0))
model_fit = model.fit()
y = model_fit.forecast()
print(y[0][0])
predict.append(y[0][0])
data = np.append(data, t)
data = pd.Series(data)
387.3764645395876
348.1541436834551
386.30880112815987
356.0820881964668
446.3794710820297
394.73728843470417
434.9154133760461
507.9234715144021
435.48276116299513
652.7439008036883
546.3434721834466
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test.values, predict)
mse
8119.124448295092
SARIMA model
ARIMA models cannot handle seasonality in data (ARIMA for only trend in data, SARIMA for both trend and seasonality).
Parameters :
SARIMA(p,d,q)(P,D,Q)m
p : Trend autoregression order
d : Trend difference order
q : Trend moving average order
P : Seasonal autoregressive order
D : Seasonal difference order
Q : Seasonal moving average order
m : the number of time steps for a single seasonal period
Another extension of SARIMA Model is SARIMAX (Seasonal Auto Regression Integrated Moving Average Exogenous) model. Here we need to input other variables too, known as exogenous variables, other than the variables to be predicted.
For example, to predict stock prices, apart from using historical closing prices, we can also input variables like volume traded, opening price, daily high, etc. We just need to focus that datetime should be kept as an index, and not a separate column.
Implementation in Python
from statsmodels.tsa.statespace.sarimax import SARIMAX
df.head()
Month MilesMM
0 1963-01-01 6827
1 1963-02-01 6178
2 1963-03-01 7084
3 1963-04-01 8162
4 1963-05-01 8462
df.index = df['Month']
result_a = seasonal_decompose(df['MilesMM'], model='multiplicative')
result_a.plot()
model = SARIMAX(df['MilesMM'], order=(5,1,3), seasonal_order=(1,1,1,12))
model_fit = model.fit()
residuals = model_fit.resid
residuals.plot()
output = model_fit.forecast()
output
1971-01-01 11475.842358
Freq: MS, dtype: float64
model_fit.forecast(12)
1971-01-01 11475.842358
1971-02-01 11153.512800
1971-03-01 13669.497445
1971-04-01 12647.357108
1971-05-01 14338.979952
1971-06-01 15786.326933
1971-07-01 14979.147877
1971-08-01 15362.201531
1971-09-01 16962.826726
1971-10-01 13682.072572
1971-11-01 12426.861771
1971-12-01 13730.089150
Freq: MS, dtype: float64
yhat = model_fit.predict()
yhat.head()
1963-01-01 0.000000
1963-02-01 5871.999261
1963-03-01 5422.112669
1963-04-01 7122.615626
1963-05-01 7067.315947
Freq: MS, dtype: float64
pyplot.plot(df['MilesMM'])
pyplot.plot(yhat, color='red')
Final thoughts
One of the most popular Data Analysis issues is Time Series Analysis. There are a number of models and methods that can be used to effectively solve Time Series Analysis problems. Time series forecasting is used in a variety of real-world applications, including:
Forecasting the Economy
Forecasting Sales and Marketing
Estimated Yields
Predictions from Seismology
Military Planning
Commentaires