Forecasting

Students cheer on the Redhawks during a sporting event at Miami University.

Making predictions based on current trends and/or past and present data can be useful. The best way to do this is by using forecasting tools. R has many packages and tools that can be installed to do this kind of work. There are also some tools already in the base package of R.

Note: This page will cover forecasting tools from the base stats package of R. However, sometimes it can be more efficient and easier to customize the tools from other time series packages in R such as forecast, TSA, and tseries. To learn more about packages please visit our Installing and Loading Packages page.

Example of Time Series Data

Data is a time series when it corresponds with certain intervals of time. If data is not expressed as a time series to begin with, the ts() command can be used to turn any dataset into a time series dataset:

ts(data, start = 1, end = numeric(), frequency = 1)

Arguments:

  • data: the vector or matrix you want to turn into a time series object (commonly referred to simply as a ts object)
  • start: the starting time point of the data (default is 1)
  • end: the end time point of the data
  • frequency: how to split the data into time points (1 means yearly, 4 means quarterly, etc.)

Example:

timeVector <- c(20, 30, 23, 55, 86, 93, 42, 63, 67, 91)
myts <- ts(timeVector, start=c(2000), end=c(2010), frequency=1)
plot(myts, main = "myts Against Time")

Line graph connecting the time series data points given in the prior chuck of code.

Note: If your data is already a ts object, only the plot() function is necessary.

Useful Functions when Working with Time Series Data

is.ts()

This function is used to check if something is a ts object. The function will either produce TRUE  (if it is a time series) or FALSE (if it is not). It is important to know how to check if something is a time series object because many functions used for time series data require time series objects in their arguments.

Example:

is.ts(timeVector)
## [1] FALSE
is.ts(myts)
## [1] TRUE

Most of the examples on the remainder of this page will use the "nottem" dataset, which is already included in your base installation of R. This dataset contains the monthly temperatures in Nottingham from 1920 to 1939. To get this dataset, run the following line of code:

data("nottem")

length()

This function prints how many observations are in a ts object.

Example:

length(nottem)
## [1] 240

head() and tail()

When a time series is very long, you may only want to look at parts of it. The head() function lets us look at a specified number of the beginning elements and the tail() function lets us look at the elements at the end.

Examples:

head(nottem, 10)
## [1] 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5
tail(nottem, 10)
## [1] 42.4 47.8 52.4 58.0 60.7 61.8 58.2 46.7 46.6 37.8

print()

This function allows the user to view the dataset by printing its contents.

Example:

print(nottem)

deltat()

This function gives the difference in the time between each value and the next one.

Example:

deltat(nottem)
## [1] 0.08333333

frequency()

This function gives the number of observations per one unit of time.

Example:

frequency(nottem)
## [1] 12

cumsum()

This function calculates the cumulative sum of the argument, returned as a vector.

Example:

cumsum(nottem)
## [1] 20 50 73 128 214 307 349 512 479 570

cor()

This function calculates the correlation between two vectors or matrices.

Example:

myts2 = rep(3, 240)
cor(nottem, myts2)
## warning in cor(nottem, myts2): the standard deviation is zero
## [1] NA

Note: Both vectors must be of the same length. In this example, since myts2 is a random vector with just the same value over and over again, it doesn't tell us anything of value. Just know this is how it works.

cov()

This function calculates the covariance between two vectors or matrices.

Example:

cov(nottem, myts2)
## [1] 0

Note: Both vectors have to be of the same length.

Transformations of Data

Often, there may be patterns in the data, such as the variance increasing at a steady rate. In a case like this, it might be beneficial to transform the data in a way that does not change how the data is related, but takes out the patterns in the data.

log()

This transformation will cause each data point in a dataset to have the log of it computed. This can help negate increasing variance.

Example: (only the first 5 rows of output are shown)

 log(nottem)
Jan Feb Mar Apr May Jun Jul Aug Sep
1920 3.703768 3.708682 3.793239 3.843744 3.990834 4.069027 4.055257 4.032469 3.994524
1921 3.788725 3.683867 3.808882 3.850148 3.990834 4.072440 4.194190 4.092677 4.043051
1922 3.624341 3.655840 3.676301 3.740048 4.019980 4.056989 4.039536 3.994524 3.994524
1923 3.732896 3.691376 3.758872 3.824284 3.895894 3.964615 4.162003 4.087656 3.996364
1924 3.671225 3.624341 3.645450 3.817712 3.974058 4.055257 4.107590 4.063885 4.032469

diff()

This transformation may be used when it is desired to remove the long term time trend. It can also be used to remove seasonal trends.

Example: (only the first 5 rows of output are shown)

 diff(nottem)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1920 0.2 3.6 2.3 7.4 4.4 -0.8 -1.3 -2.1 -3.8 -7.6 -3.1
## 1921 4.4 -4.4 5.3 1.9 7.1 4.6 7.6 -6.4 -2.9 -2.8 -14.5 3.1
## 1922 -5.3 1.2 0.8 2.6 13.6 2.1 -1.0 -2.5 0.0 -7.2 -5.3 -0.1
## 1923 0.1 -1.7 2.8 2.9 3.4 3.5 11.5 -4.6 -5.2 -5.2 -12.9 1.3
## 1924 1.7 -1.8 0.8 7.2 7.7 4.5 3.1 -2.6 -1.8 -6.6 -5.4 -0.8
diff(nottem, lag = 12) ## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1921 3.6 -1.0 0.7 0.3 0.0 0.2 8.6 3.5 2.7 3.7 -3.2 3.0
## 1922 -6.7 -1.1 -5.6 -4.9 1.6 -0.9 -9.5 -5.6 -2.7 -7.1 2.1 -1.1
## 1923 4.3 1.4 3.4 3.7 -6.5 -5.1 7.4 5.3 0.1 2.1 -5.5 -4.1
## 1924 -2.5 -2.6 -4.6 -0.3 4.0 5.0 -3.4 -1.4 2.0 0.6 8.1 6.0

Note: The lag parameter specifies what lag the transformation should be made at. For each lag, one observation will be lost (example: 12 lags will cause the first 12 observations to not have points).

Other Useful Time Series Functions

start()

This function gives the starting time point of the first value in a dataset, if the set is a time series.

Example:

start(nottem)
## [1] 1920 1

end()

This function gives the ending time point of the last value in a dataset again assuming the set is a time series.

Example:

end(nottem)
## [1] 1939 12

time()

This function outputs a matrix of values that tells the user at what time each observation is recorded. It is calculated by taking the number of observations minus one all divided by the total number of observations. This assumes the dataset is a time series.

Example: (only the first 5 rows are shown)

 ime(nottem)
## Jan Feb Mar Apr May Jun Jul
## 1920 1920.000 1920.083 1920.167 1920.250 1920.333 1920.417 1920.500
## 1921 1921.000 1921.083 1921.167 1921.250 1921.333 1921.417 1921.500
## 1922 1922.000 1922.083 1922.167 1922.250 1922.333 1922.417 1922.500
## 1923 1923.000 1923.083 1923.167 1923.250 1923.333 1923.417 1923.500
## 1924 1924.000 1924.083 1924.167 1924.250 1924.333 1924.417 1924.500
## Aug Sep Oct Nov Dec
## 1920 1920.583 1920.667 1920.750 1920.833 1920.917
## 1921 1921.583 1921.667 1921.750 1921.833 1921.917
## 1922 1922.583 1922.667 1922.750 1922.833 1922.917
## 1923 1923.583 1923.667 1923.750 1923.833 1923.917
## 1924 1924.583 1924.667 1924.750 1924.833 1924.917

cycle()

This gives the number of the observations in each cycle, assuming the dataset is a time series.

Example: (only the first 5 rows are shown)

 cycle(nottem)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1920 1 2 3 4 5 6 7 8 9 10 11 12
## 1921 1 2 3 4 5 6 7 8 9 10 11 12
## 1922 1 2 3 4 5 6 7 8 9 10 11 12
## 1923 1 2 3 4 5 6 7 8 9 10 11 12
## 1924 1 2 3 4 5 6 7 8 9 10 11 12

stl()

This is used to separate a time series into seasonal, trend and irregular components. These can then be plotted to view each of the individual components.

  stl(x, s.window)


Argument meanings:

  • x: the time series to be decomposed
  • s.window: sets how quickly the seasonality evolves

Example:

nottemstl = stl(nottem, s.window = "periodic")
plot(nottemstl)
dev.copy(png, 'stl.png')
dev.off()

Four line plots are stacked on top of each other in a vertical layout with all sharing the same horizontal axis depicting the year. Starting with the topmost plot, the vertical axes correspond to the original data values, the seasonal, the trend, and remainder (or error) component. The second plot clearly shows an up/down pattern of the seasonal component, the third plot shows no signs of there being a trend component since its corresponding plot appears to randomly move up and down, while the bottom plot shows no pattern in the error term with a few possible outliers.

acf() and pacf()

acf stands for "autocorrelation function". It plots the estimates of the covariance. pacf stands for "partial autocorrelation function." Both show a plot by default.

Example:

acf(nottem)

The ACF plot depicts the ACF value on the vertical axis and number of lags on the horizontal axis. The plot shows clear signs of autocorrelation present by there being a repetitive series of spiking ACF values alternating above and below zero and well beyond the horizontal reference lines for acceptable limits.

pacf(nottem)
Partial ACF plot shows lag value on the horizontal axis and PACF value on the vertical axis. A clear positive maximum value of 0.8 begins the series followed by seven successive negative PACF values, many of which are well beyond the limiting threshold of around negative 0.15. The series becomes positive up until lag 1 and then trails off to zero thereafter.

arima()

"arima" stands for autoregressive integrated moving average. With this function, a number of models can be created with the mean and intercept output.

arima(x, order = c(0L, 0L, 0L))


Argument meanings:
  • x: is a univariate time series
  • order: the order of the model to be fit (with the first number corresponding to AR, the second to I, which is the differencing level, and the third to MA).

arima.sim()

This function is used to create a simulation of an arima model.

Argument meanings:
  • model: the kind of order to simulate
  • n: the number of observations

There are a number of different kinds of models that can be created: ARIMA, ARMA, MA, and AR. The AR part in arima is for the variable of interest in the dataset being regressed on its own lagged values. The MA part in arima is for the regression error terms and indicates that they are a linear combination of error terms. If the middle value of the order = c(0L, 0L, 0L) is zero with the first and third being numbers, this is an ARMA model. ARIMA has values for all three.

AR = order = c(p, 0, 0)
MA = order = c(0, 0, q)
ARMA= order = c(p, 0, q)
ARIMA = order = c(p, r, q)

Example:

fit = arima(nottem, order = c(1, 0, 0), list(order = c(2, 10, 0), period = 12))
acf(fit$resid)

The ACF plot for the more complex model shows the expected spike of 1 initially and then randomly fluctuates within the threshold limits of around positive and negative 0.15 thereafter.

The acf for the residuals seems to be good, so this model is acceptable.

AIC() and BIC()

AIC and BIC are information criterion measures that are used to help determine which model is the best from the ones that have been fit, like the ARIMA models we have been looking at. The best model will have lower values for AIC and BIC; however, the values have no directly interperatable meaning. That is, the values are only comparable to themselves and have no meaningful units. A general rule of thumb is that a change in AIC or BIC of less than 6 or 7 is likely not that meaningful.

Examples:

 AIC(fit)
## [1] 1061.185
BIC(fit)
## [1] 1074.902

Need a Refresher?

Go back to the beginner tutorials.