Daniel Green

Forecasting Time Series: ARIMA(p, d, q) Model

Published at: Mon Dec 22 2025

Assumed Knowledge

In this article, I assume you have read the previous articles:

In previous articles I explain the basics of what is needed to understand this article, but because I try to show my derivations of things as clearly as possible, there may be a lot of excess. I also introduce quite a few new concepts in this series about notation for time series mathematics.

What is an ARIMA(p, d, q) Model?

ARIMA(p, d, q) stands for Autoregressive Integrated Moving-Average model of order p, d, q. You can see the similarities to the ARMA(p, q) model by the name, but there is an added “Integrated” part. This part represents the “differencing” operation and the need to integrate the predicted difference to get to the actual prediction, where you basically calculate the difference between the current value and the previous value, for all values in the dataset, then you do this d times, using the new values each time. After you’ve differenced the series d times until it is weakly-stationary, you then fit an ARMA(p, q) model to the data. This has a major advantage over the ARMA(p, q) model, and previously mentioned models, as it allows you to fit a model to non-stationary data with the limitation being that the data must have a constant variance. So you can describe this model being able to be fit on data that is non-stationary and homoscedastic. Homoscedastic sounds complex but really just means that the variance is constant, as opposed to heteroscedastic, meaning a changing variance with time. This will be important in later articles.

Defining this model mathematically is a bit more complex compared with the previously mentioned models, as the data must first be differenced. I will now describe it in as much detail as I can:

zt={Z2,Z3,...,ZT}zt1={Z1,Z2,...,ZT1}zt=ztzt1=Δ1zt\begin{gather*} z_t=\{Z_2,Z_3,...,Z_T\} \\ z_{t-1}=\{Z_1,Z_2,...,Z_{T-1}\} \\ z'_t=z_t-z_{t-1} = \Delta^1z_t \end{gather*}

Here I define the actual time series, with the second time series just being the first shifted to the left by 1. I also describe the new time series that is 1st order differenced. This also allows me to define a more general term for differencing of order d:

Δdzt=Δd1ztΔd1zt1Δdzt=Δd1ztΔd1zt1\Delta^{d}z_t=\Delta^{d-1}z_t-\Delta^{d-1}z_{t-1}\Delta^{d}z_t=\Delta^{d-1}z_t-\Delta^{d-1}z_{t-1}

Now, a simple ARMA(p, q) model can be defined using our differenced time series of order d:

Δdz(t)=γ+i=1pβiΔdz(ti)+i=1qθiϵti+ϵt\Delta^dz(t)=\gamma+\sum^{p}_{i=1}{\beta_i\Delta^dz(t-i)}+\sum^{q}_{i=1}\theta_i\epsilon_{t-i}+\epsilon_t

How do we actually get to a prediction?

Now that I have defined the ARIMA(p, d, q) model in terms of its differenced form of order d, you may be wondering how to turn a differenced form into an actual prediction of the next value.

This can be done by simply rearranging whatever our difference equation was, so for d = 1 you get:

zt=Δ1zt+zt1z_t=\Delta^1z_t+z_{t-1}

For d = 2, you get this rearrangement:

Δ1zt=ztzt1Δ1zt1=zt1zt2Δ2zt=Δ1ztΔ1zt1Δ2zt=(ztzt1)(zt1zt2)Δ2zt=zt2zt1+zt2zt=Δ2zt+2zt1zt2\begin{align*} &\Delta^1z_t=z_t-z_{t-1} \\ &\Delta^1z_{t-1}=z_{t-1}-z_{t-2} \\ &\Delta^2z_t=\Delta^1z_t-\Delta^1z_{t-1} \\ &\Delta^2z_t=(z_t-z_{t-1})-(z_{t-1}-z_{t-2}) \\ &\Delta^2z_t=z_t-2z_{t-1}+z_{t-2} \\ &z_t=\Delta^2z_t+2z_{t-1}-z_{t-2} \end{align*}

So you can see that to find the next prediction of ztz_t, you need to know the previous 2 values. Let’s also try this for d = 3:

Δ2zt=zt2zt1+zt2Δ2zt1=zt12zt2+zt3Δ3zt=Δ2ztΔ2zt1Δ3zt=(zt2zt1+zt2)(zt12zt2+zt3)Δ3zt=zt3zt1+3zt2zt3zt=Δ3ztzt+3zt13zt2+zt3\begin{align*} &\Delta^2z_t=z_t-2z_{t-1}+z_{t-2} \\ &\Delta^2z_{t-1}=z_{t-1}-2z_{t-2}+z_{t-3} \\ &\Delta^3z_t=\Delta^2z_t-\Delta^2z_{t-1} \\ &\Delta^3z_t=(z_t-2z_{t-1}+z_{t-2})-(z_{t-1}-2z_{t-2}+z_{t-3}) \\ &\Delta^3z_t=z_t-3z_{t-1}+3z_{t-2}-z_{t-3} \\ &z_t=\Delta^3z_t-z_t+3z_{t-1}-3z_{t-2}+z_{t-3} \end{align*}

You can see that you need to know d past time series values to predict the next, when using this model. You may also notice that our expansions coefficients resemble pascals triangle:

Pascals Triangle

This leads us to the following equation for d = 3:

Δ3zt=(30)zt(31)zt1+(32)zt2(33)zt3\begin{align*} \Delta^3z_t=\begin{pmatrix} 3 \\ 0 \end{pmatrix}z_t-\begin{pmatrix} 3 \\ 1 \end{pmatrix}z_{t-1}+\begin{pmatrix} 3 \\ 2 \end{pmatrix}z_{t-2}-\begin{pmatrix} 3 \\ 3 \end{pmatrix}z_{t-3} \end{align*}

There is an oscillation between positive and negative terms, which I account for here:

Δ3zt=(1)0(30)zt+(1)1(31)zt1+(1)2(32)zt+(1)3(33)zt3\begin{align*} \Delta^3z_t=(-1)^0\begin{pmatrix} 3 \\ 0 \end{pmatrix}z_t+(-1)^1\begin{pmatrix} 3 \\ 1 \end{pmatrix}z_{t-1}+(-1)^2\begin{pmatrix} 3 \\ 2 \end{pmatrix}z_{t}+(-1)^3\begin{pmatrix} 3 \\ 3 \end{pmatrix}z_{t-3} \end{align*}

You can now see how this can be turned into a generalized formula, which I do here:

Δdzt=(1)0(d0)zt+(1)1(d1)zt1+...+(1)d(dd)ztdΔdzt=i=0d(1)i(di)zti\begin{gather*} \Delta^dz_t=(-1)^0\begin{pmatrix} d \\ 0 \end{pmatrix}z_t+(-1)^1\begin{pmatrix} d \\ 1 \end{pmatrix}z_{t-1}+...+(-1)^d\begin{pmatrix} d \\ d \end{pmatrix}z_{t-d} \\ \Delta^dz_t=\sum^{d}_{i=0}{(-1)^i\begin{pmatrix} d \\ i \end{pmatrix}z_{t-i}} \end{gather*}

This is about as simple as time series integration can get without introducing new notation, which I’m going to do now. To make this a bit more readable and make writing out further time series equations more simple, I introduce the lag operator, L. This has the following properties:

Lzt=zt1L2zt=zt2Ldzt=ztdLz_t=z_{t-1} \\ L^2z_t=z_{t-2} \\ L^dz_t=z_{t-d}

So you can see how it works, and this means I can turn our rather lengthy expression for the integration into something more succinct:

Δdzt=(1L)dzt\Delta^dz_t=(1-L)^dz_t

So you can rearrange this for ztz_t quite easily:

zt=Δdzt(1L)dz_t=\Delta^dz_t(1-L)^{-d}

But we can also use our previous definition using a binomial expansion to make a solution that you can easily turn into an algorithm to solve it:

Δdzt=zt+i=1d(1)i(di)ztizt=Δdzti=1d(1)i(di)zti\Delta^dz_t=z_t+\sum^{d}_{i=1}{(-1)^i\begin{pmatrix} d \\ i \end{pmatrix}z_{t-i}} \\ z_t=\Delta^dz_t-\sum^{d}_{i=1}{(-1)^i\begin{pmatrix} d \\ i \end{pmatrix}z_{t-i}}

How do we fit an ARIMA(p, d, q) model?

Just like the previous models, to fit this model, we must first find what p, d, and q should be set to. Naturally, we find d first as the data must be stationary to fit the ARMA(p, q) model to it. This can be done by just continually differencing our data until it is stationary, which we can check using an Augmented Dickey-Fuller test.

Generating some ARIMA(p, d, q) data

Now to generate some data that has an ARIMA(p, d, q) model. For this, I have picked a value of d = 1, as this is a simpler thing to implement. And I will pick a value of p = 2, and q = 1. Now picking our two coefficients for the autoregressive part requires us to abide by the following constraint:

0=1β1zβ2z2β2z2+β1z1=00=1-\beta_1z-\beta_2z^2 \\ \beta_2z^2+\beta_1z-1=0

Where the roots of this must be under the following constraint:

zi>1|z_i|>1

I wrote the following python code to generate some potential values of β1\beta_1 and β2\beta_2, which is below:

import numpy as np

beta_1s = np.linspace(0.1, 1.5, 15)
beta_2s = np.linspace(0.1, 1.5, 15)
c = -1.0

for i in range(len(beta_1s)):
    for j in range(len(beta_2s)):
        beta_1 = beta_1s[i]
        beta_2 = beta_2s[j]
        roots = np.roots([beta_2, beta_1, c])
        all_outside_unit_circle = np.all(np.abs(roots) > 1.0)
        if all_outside_unit_circle:
            print(f"Found one with beta_1 = {round(beta_1, 2)} and beta_2 = {round(beta_2, 2)}")

From this, I chose to pick the following values: β1=0.3,β2=0.5\beta_1=0.3,\beta_2=0.5

First, I create some ARMA(2,1) data using the following code:

import numpy as np
import matplotlib.pyplot as plt

N = 1000
np.random.seed(4)
norm = np.random.normal(0, 1, N)
x = np.linspace(0, N)
arma_data = np.zeros(N) + norm
beta_1 = 0.3
beta_2 = 0.5
theta_1 = -0.5
for i in range(2, N):
    arma_data[i] = beta_1 * arma_data[i-1] + beta_2 * arma_data[i-2] + theta_1 * norm[i-1] + norm[i]

plt.plot(arma_data)
plt.title("ARMA(2,1) with $\\beta_1 = 0.3,\\beta_2 = 0.5$ and $\\theta_1 = -0.5$")
plt.xlabel("Timestep")
plt.show()

Which produces the following plot:

ARMA(2,1)

Now to turn this into ARIMA(2,1,1) data, we must define an initial value, which I will set to 20 and then create the rest of the data from that. I do this with the following code:

data = np.zeros(N)
data[0] = 20
for i in range(1, len(data)):
    data[i] = arma_data[i] + data[i-1]

plt.plot(data)
plt.title("ARIMA(2,1,1) with $\\beta_1 = 0.3,\\beta_2 = 0.5$ and $\\theta_1 = -0.5$")
plt.xlabel("Timestep")
plt.show()

This produces the following plot:

ARIMA(2,1,1)

Fitting to our data

To fit to our data, we will first find out what order of d to use. This can be done by using the following code:

from statsmodels.tsa.stattools import adfuller

def difference(data: np.ndarray, order: int):
    if order == 0:
        return data
    differenced = np.zeros(len(data) - 1)
    for i in range(0, len(data)-1):
        differenced[i] = data[i+1] - data[i]
    return difference(differenced, order-1)

differenced_data = data

for d in range(1, 5):
    differenced_data = difference(data, d)
    result = adfuller(differenced_data)
    value = result[0]
    one_percent_critical = result[-2]['1%'] # type: ignore
    if value < one_percent_critical:
        print(f"The order d = {d}")
        break

This obviously tells us to use the first order for d, as we’d expect. I now plot the autocorrelation and partial autocorrelation functions, to see if we can get any insight on what the orders for p and q might be by just looking at the data. This results in the following plots:

An ImageAn Image

Which tells us a little about the data, such as the fact that p is most likely order of 2, but could be as high as 4. When there are more orders to the generated data, the PACF plot interpretation usually involves ignoring small values, even if they are larger than the 5% significance level. And the autocorrelation graph is seemingly all over the place, but seems to stop having any significance at a lag of 5. So we will have to fit this model in the same way the last one was fit, which is done with the following code:

from statsmodels.tsa.arima.model import ARIMA

train_data = data[:int(N*0.9)]
test_data = data[int(-N*0.1):]

min_bic = np.inf
min_fit = None
for p in range(1, 5):
    for q in range(1, 6):
        model = ARIMA(train_data, order=(p, 1, q))
        model_fit = model.fit()
        if model_fit.bic < min_bic:
            min_bic = model_fit.bic
            min_fit = model_fit

if min_fit != None:
    print(min_fit.summary())

This produces the following output:

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  900
Model:                 ARIMA(2, 1, 1)   Log Likelihood                -616.906
Date:                Mon, 22 Dec 2025   AIC                           1241.811
Time:                        03:33:09   BIC                           1261.016
Sample:                             0   HQIC                          1249.148
                                - 900                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.3235      0.049      6.584      0.000       0.227       0.420
ar.L2          0.4817      0.031     15.656      0.000       0.421       0.542
ma.L1         -0.5208      0.052     -9.998      0.000      -0.623      -0.419
sigma2         0.2308      0.011     20.253      0.000       0.208       0.253
===================================================================================
Ljung-Box (L1) (Q):                   0.01   Jarque-Bera (JB):                 0.84
Prob(Q):                              0.93   Prob(JB):                         0.66
Heteroskedasticity (H):               0.87   Skew:                             0.01
Prob(H) (two-sided):                  0.24   Kurtosis:                         2.85
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

You can see that the fit coefficients are quite close to the ones we used to generate data.

Predicting the test data values

To use our model to predict the values for the test data, I use the following code:

forecast_result = model_fit.get_forecast(steps=int(N*0.1))
forecast_values = forecast_result.predicted_mean
confidence_intervals = forecast_result.conf_int()

# Plotting the results
plt.figure(figsize=(10, 5))
plt.plot(test_data, label="Actual (Test Data)")
plt.plot(forecast_values, label="Forecast", color="red")
plt.fill_between(
    range(len(test_data)),
    confidence_intervals[:, 0],
    confidence_intervals[:, 1],
    color="pink",
    alpha=0.3,
    label="95% Confidence Interval"
)
plt.title("ARIMA(2,1,1) Forecast vs Actual")
plt.legend()
plt.show()

This produces this plot:

ARIMA(2,1,1) Forecast

Here you can see, that none of the values are outside out 95% confidence interval, and that unlike the other models that had an essentially constant error interval, this models error interval grows over time. This is to be expected, as this model allows for the mean to change with time, meaning we can do trend prediction with this model; as long as the data is homoscedastic.