Assumed Knowledge
In this article, I assume you have read the previous articles:
- Forecasting Time Series: AR(p) Model (opens in a new tab)
- Forecasting Time Series: MA(q) Model (opens in a new tab)
- Forecasting Time Series: ARMA(p, q) Model (opens in a new tab)
In previous articles I explain the basics of what is needed to understand this article, but because I try to show my derivations of things as clearly as possible, there may be a lot of excess. I also introduce quite a few new concepts in this series about notation for time series mathematics.
What is an ARIMA(p, d, q) Model?
ARIMA(p, d, q) stands for Autoregressive Integrated Moving-Average model of order p, d, q. You can see the similarities to the ARMA(p, q) model by the name, but there is an added “Integrated” part. This part represents the “differencing” operation and the need to integrate the predicted difference to get to the actual prediction, where you basically calculate the difference between the current value and the previous value, for all values in the dataset, then you do this d times, using the new values each time. After you’ve differenced the series d times until it is weakly-stationary, you then fit an ARMA(p, q) model to the data. This has a major advantage over the ARMA(p, q) model, and previously mentioned models, as it allows you to fit a model to non-stationary data with the limitation being that the data must have a constant variance. So you can describe this model being able to be fit on data that is non-stationary and homoscedastic. Homoscedastic sounds complex but really just means that the variance is constant, as opposed to heteroscedastic, meaning a changing variance with time. This will be important in later articles.
Defining this model mathematically is a bit more complex compared with the previously mentioned models, as the data must first be differenced. I will now describe it in as much detail as I can:
Here I define the actual time series, with the second time series just being the first shifted to the left by 1. I also describe the new time series that is 1st order differenced. This also allows me to define a more general term for differencing of order d:
Now, a simple ARMA(p, q) model can be defined using our differenced time series of order d:
How do we actually get to a prediction?
Now that I have defined the ARIMA(p, d, q) model in terms of its differenced form of order d, you may be wondering how to turn a differenced form into an actual prediction of the next value.
This can be done by simply rearranging whatever our difference equation was, so for d = 1 you get:
For d = 2, you get this rearrangement:
So you can see that to find the next prediction of , you need to know the previous 2 values. Let’s also try this for d = 3:
You can see that you need to know d past time series values to predict the next, when using this model. You may also notice that our expansions coefficients resemble pascals triangle:

This leads us to the following equation for d = 3:
There is an oscillation between positive and negative terms, which I account for here:
You can now see how this can be turned into a generalized formula, which I do here:
This is about as simple as time series integration can get without introducing new notation, which I’m going to do now. To make this a bit more readable and make writing out further time series equations more simple, I introduce the lag operator, L. This has the following properties:
So you can see how it works, and this means I can turn our rather lengthy expression for the integration into something more succinct:
So you can rearrange this for quite easily:
But we can also use our previous definition using a binomial expansion to make a solution that you can easily turn into an algorithm to solve it:
How do we fit an ARIMA(p, d, q) model?
Just like the previous models, to fit this model, we must first find what p, d, and q should be set to. Naturally, we find d first as the data must be stationary to fit the ARMA(p, q) model to it. This can be done by just continually differencing our data until it is stationary, which we can check using an Augmented Dickey-Fuller test.
Generating some ARIMA(p, d, q) data
Now to generate some data that has an ARIMA(p, d, q) model. For this, I have picked a value of d = 1, as this is a simpler thing to implement. And I will pick a value of p = 2, and q = 1. Now picking our two coefficients for the autoregressive part requires us to abide by the following constraint:
Where the roots of this must be under the following constraint:
I wrote the following python code to generate some potential values of and , which is below:
import numpy as np
beta_1s = np.linspace(0.1, 1.5, 15)
beta_2s = np.linspace(0.1, 1.5, 15)
c = -1.0
for i in range(len(beta_1s)):
for j in range(len(beta_2s)):
beta_1 = beta_1s[i]
beta_2 = beta_2s[j]
roots = np.roots([beta_2, beta_1, c])
all_outside_unit_circle = np.all(np.abs(roots) > 1.0)
if all_outside_unit_circle:
print(f"Found one with beta_1 = {round(beta_1, 2)} and beta_2 = {round(beta_2, 2)}")From this, I chose to pick the following values:
First, I create some ARMA(2,1) data using the following code:
import numpy as np
import matplotlib.pyplot as plt
N = 1000
np.random.seed(4)
norm = np.random.normal(0, 1, N)
x = np.linspace(0, N)
arma_data = np.zeros(N) + norm
beta_1 = 0.3
beta_2 = 0.5
theta_1 = -0.5
for i in range(2, N):
arma_data[i] = beta_1 * arma_data[i-1] + beta_2 * arma_data[i-2] + theta_1 * norm[i-1] + norm[i]
plt.plot(arma_data)
plt.title("ARMA(2,1) with $\\beta_1 = 0.3,\\beta_2 = 0.5$ and $\\theta_1 = -0.5$")
plt.xlabel("Timestep")
plt.show()Which produces the following plot:

Now to turn this into ARIMA(2,1,1) data, we must define an initial value, which I will set to 20 and then create the rest of the data from that. I do this with the following code:
data = np.zeros(N)
data[0] = 20
for i in range(1, len(data)):
data[i] = arma_data[i] + data[i-1]
plt.plot(data)
plt.title("ARIMA(2,1,1) with $\\beta_1 = 0.3,\\beta_2 = 0.5$ and $\\theta_1 = -0.5$")
plt.xlabel("Timestep")
plt.show()This produces the following plot:

Fitting to our data
To fit to our data, we will first find out what order of d to use. This can be done by using the following code:
from statsmodels.tsa.stattools import adfuller
def difference(data: np.ndarray, order: int):
if order == 0:
return data
differenced = np.zeros(len(data) - 1)
for i in range(0, len(data)-1):
differenced[i] = data[i+1] - data[i]
return difference(differenced, order-1)
differenced_data = data
for d in range(1, 5):
differenced_data = difference(data, d)
result = adfuller(differenced_data)
value = result[0]
one_percent_critical = result[-2]['1%'] # type: ignore
if value < one_percent_critical:
print(f"The order d = {d}")
breakThis obviously tells us to use the first order for d, as we’d expect. I now plot the autocorrelation and partial autocorrelation functions, to see if we can get any insight on what the orders for p and q might be by just looking at the data. This results in the following plots:


Which tells us a little about the data, such as the fact that p is most likely order of 2, but could be as high as 4. When there are more orders to the generated data, the PACF plot interpretation usually involves ignoring small values, even if they are larger than the 5% significance level. And the autocorrelation graph is seemingly all over the place, but seems to stop having any significance at a lag of 5. So we will have to fit this model in the same way the last one was fit, which is done with the following code:
from statsmodels.tsa.arima.model import ARIMA
train_data = data[:int(N*0.9)]
test_data = data[int(-N*0.1):]
min_bic = np.inf
min_fit = None
for p in range(1, 5):
for q in range(1, 6):
model = ARIMA(train_data, order=(p, 1, q))
model_fit = model.fit()
if model_fit.bic < min_bic:
min_bic = model_fit.bic
min_fit = model_fit
if min_fit != None:
print(min_fit.summary())This produces the following output:
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 900
Model: ARIMA(2, 1, 1) Log Likelihood -616.906
Date: Mon, 22 Dec 2025 AIC 1241.811
Time: 03:33:09 BIC 1261.016
Sample: 0 HQIC 1249.148
- 900
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.3235 0.049 6.584 0.000 0.227 0.420
ar.L2 0.4817 0.031 15.656 0.000 0.421 0.542
ma.L1 -0.5208 0.052 -9.998 0.000 -0.623 -0.419
sigma2 0.2308 0.011 20.253 0.000 0.208 0.253
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 0.84
Prob(Q): 0.93 Prob(JB): 0.66
Heteroskedasticity (H): 0.87 Skew: 0.01
Prob(H) (two-sided): 0.24 Kurtosis: 2.85
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).You can see that the fit coefficients are quite close to the ones we used to generate data.
Predicting the test data values
To use our model to predict the values for the test data, I use the following code:
forecast_result = model_fit.get_forecast(steps=int(N*0.1))
forecast_values = forecast_result.predicted_mean
confidence_intervals = forecast_result.conf_int()
# Plotting the results
plt.figure(figsize=(10, 5))
plt.plot(test_data, label="Actual (Test Data)")
plt.plot(forecast_values, label="Forecast", color="red")
plt.fill_between(
range(len(test_data)),
confidence_intervals[:, 0],
confidence_intervals[:, 1],
color="pink",
alpha=0.3,
label="95% Confidence Interval"
)
plt.title("ARIMA(2,1,1) Forecast vs Actual")
plt.legend()
plt.show()This produces this plot:

Here you can see, that none of the values are outside out 95% confidence interval, and that unlike the other models that had an essentially constant error interval, this models error interval grows over time. This is to be expected, as this model allows for the mean to change with time, meaning we can do trend prediction with this model; as long as the data is homoscedastic.