What is lag in a time series?

I am curious about what a lagging time series is. On investopedia, I saw an article that said that: "Autocorrelation is degree of similarity between time series and a lagged version of itself over successive intervals." Someone please explain to me what "lagged" means, and why autocorrelation matters in relation to time series analysis. Does autocorrelation mean that the time series will perform like the past?

Thanks!

Edit: Thanks for everyone's answers, especially the 2 thumbs-up answer earlier. That was very helpful.

Now I am wondering why autocorrelation even matters. Sure a function may correlate with a shifted version of itself, but who says that that function will perform like that? Is it just through correlation? Why does this matter in context of autoregressive models, and how did we develop this autocorrelation, then ARM/ARIMA kinda thing to model time series in the first place. Who developed time series?


Solution 1:

Lag is essentially delay. Just as correlation shows how much two timeseries are similar, autocorrelation describes how similar the time series is with itself.

Consider a discrete sequence of values, for lag 1, you compare your time series with a lagged time series, in other words you shift the time series by 1 before comparing it with itself. Proceed doing this for the entire length of time series by shifting it by 1 every time. You now have autocorrelation function.

From the values of autocorrelation function, you can see how much it correlates with itself. For any time series you will have perfect correlation at lag/delay = 0, since you're comparing same values with each other. As you shift your time series you begin to see the correlation values decreasing. Note that if timeseries comprises of completely random values, you will only have correlation at lag=0, and no correlation everywhere else. In most of the datasets/time series this is not the case, as values tend to decrease over time, thus having some correlation at low lag values.

Now, consider a long periodic time series, for example outdoor temperature over a few years, sampled hourly. Your time series will correlate with itself on daily basis (day/night temperature drop) as well as yearly (summer/winter temperatures). Lets say your first datapoint is at 1 pm in mid summer. Lag=1 represents one hour. The autocorrelation function at lag=1 will experience a slight decrease in correlation. At lag=12 you will have the lowest correlation of the day, after what it will begin to increase. Move forward 6 month to 1 pm. Your time series is still somewhat correlated. Move lag to 6 months and 1 am. This might be your lowest correlation point in the time series. At lag of 12 months your timeseries is again close to the peak value.

You might have noticed from the previous example that autocorrelation function reveals frequency components of a time series. Indeed, it is closely tied to frequency domain, and is just fourier transform from becoming a power spectra.

For a random time series, autocorrelation function will show you how quickly it becomes unsimilar with itself, while periodic time series will show at what delay/lag values time series is similar with itself.

Hope this isn't as confusing as it seems.

Solution 2:

I will illustrate with with some geological data, known to have interesting autocorrelation. In the summer of 1987 rangers at Yellowstone National Park measured times between eruptions of Old Faithful Geyser. This geyser is well-known for its relatively regular eruptions, but it is not a clock. One goal in collecting these data was to find a way to predict the time of the next eruption for the convenience of tourists waiting to see an eruption.

Data (in minutes) for $n = 107$ (almost) consecutive waiting times are as follows:

x = c(78, 74, 68, 76, 80, 84, 50, 93, 55, 76, 58, 74, 75, 80, 56, 80, 69, 57,
      90, 42, 91, 51, 79, 53, 82, 51, 76, 82, 84, 53, 86, 51, 85, 45, 88, 51,
      80, 49, 82, 75, 73, 67, 68, 86, 72, 75, 75, 66, 84, 70, 79, 60, 86, 71,
      67, 81, 76, 83, 76, 55, 73, 56, 83, 57, 71, 72, 77, 55, 75, 73, 70, 83,
      50, 95, 51, 82, 54, 83, 51, 80, 78, 81, 53, 89, 44, 78, 61, 73, 75, 73,
      76, 55, 86, 48, 77, 73, 70, 88, 75, 83, 61, 78, 61, 81, 51, 80, 79)

In order to see if the wait for the last eruption is useful in predicting the wait for the next, one can consider the correlation between the vector $x_1 = (78, 74, 68, \dots, 51, 80)$ and the vector $x_2 = (74, 68, \dots, 80, 79),$ which is 'lagged' by one eruption.

The correlation is $r_{1,2} = -0.685,$ indicating that short waits tend to be followed by long ones. as shown in the plot below. The calculation of the autocorrelation in R statistical software is:

x.1 = x[1:106];  x.2 = x[2:107];  cor(x.1, x.2)
## -0.6849171

enter image description here

The autocorrelation function (ACF) shows correlations for several lags (of order 2, 3, 4, etc.) in addition to the lag of order 1 just illustrated.

enter image description here

The first few lags show alternate negative and positive autocorrelations. Autocorrelations that fall within the band marked by the dotted blue lines are deemed not to be significantly different from $0.$ (Of course, the autocorrelation for 'lag 0' is just the correlation $r = 1$ of $x$ with itself.)

While it is impossible to say whether a time series (economic or geological) will continue past behavior into the future, autocorrelation methods have proved effective in some kinds of short-range forecasting.

In the Old Faithful example, the length of the wait for the last eruption did provide a useful guide to predicting the wait for the next eruption. In fact, these inter-eruption times form a Markov Chain in which several past eruptions still provide useful predictive information (until the 'one-step' dependence of the Markov chain 'wears off'.)

Notes: (1) The usual definition of correlation $$r_{xy} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{S_XS_y},$$ where $\bar X$ and $\bar Y$ are sample means and $S_X$ and $S_Y$ are sample standard deviations, is somewhat modified for autocorrelations because $X$ and $Y$ are the same, except for the lag. The modifications are that the sample means and SDs for the whole series are used, and the sum is taken over $n - \ell$ terms, where $\ell$ is the order of the lag.

(2) I said that the eruptions in the series $x$ are 'almost' consecutive. In fact, there are a few gaps where nighttime eruptions are missing, but they do not interfere with the fundamental story about autocorrelation.

(3) Since these data were collected, there have been several earthquakes near Old Faithful geyser, which have rearranged the underground 'plumbing' for hot water feeding the geyser. So recent data on inter-eruption times are slightly different. Several websites undertake to provide contemporaneous data.

(4) You may find the Wikipedia article on autocorrelation useful. (But the notation is not the same as I am used to.)