Wednesday, November 21, 2012

More on Autocorrelation of Temperature Data

This is pretty wonkish and probably not of interest to anyone else, but... I have been trying to better understand autocorrelation in time series -- the notion that a data point, like a temperature, depends on the points just before it -- and especially how to calculate autocorrelation coefficients.

Here is a good review of autocorrelation in time series, if you're looking for it, by Dave Meko at the University of Arizona. This Appendix by Tom Wigley explains more, and his equation 9 is especially important.

I do most of my calculations in Excel, so I've been trying to figure out how to calculate the autocorrelation coefficients rk for a given lag k, using Excel. (The "lag" k is the number of points away that you're looking for a correlation -- temperatures in relation to last month is "lag 1," in relation to two months ago is "lag 2," and so on).

Here is the magic formula:

rk = PEARSON(OFFSET($data$,0,0,N-k,1),OFFSET($data$,lag,0,N-k,1))

where N is the number of data points in your time series, k is the lag, and $data$ is the array containing your data, such as A1:A180 for a 15-year series of monthly temperatures contained in column A. (After I figured this out, I then found it in this Excel forum discussion.)

Once you know the autocorrelation coefficients, you can use them in the calculation of the uncertainty of a linear trend. In essence, the number of independent data points in the series is reduced -- usually drastically so. For just 1st-order correlation, the effective number of data points is (Meko eq 15Wigley eq 9)
Neff = N(1-r1)/(1+r1)

This number can be much less than the number of actual data points. For example, for the monthly RSS lower troposphere temperature, and a 15-year linear trend, the number of data points is 180 (=12*15). But even just the lag-1 autocorrelation is so high (r1 = 0.746) that the effective number of independent degrees of freedom is, for the 15 years up to October 2012, only 26.3 --  just over two years.

So you can begin to see why talking about trends over the last 15 years, or since 1998, or whatever, is really meaningless -- because of autocorrelation, you are really only talking about 2-3 years of independent data.

For higher autocorrelations -- and the typical time series you find in climate change usually contains significant autocorrelation beyond lag-1 -- the relationship is much more complicated, and I am still trying to figure this out. (Lee and Lund discuss it here, which I found a copy of somewhere, but it's pretty heavy mathematics that I'm still working through). Most climatologists only consider lag-1 autocorrelation, as did Foster and Rahmstorf here, because it's much easier. But it's not the final word (nor, as Lee and Lund show, do higher autocorrelations necessarily mean more uncertainty about the trend, a somewhat counterintuitive result).

For example, for the 15-year trend of the RSS lower troposphere data, concluding in October 2012, I find:
r1 = 0.746
r2 = 0.648
r3 = 0.533
r4 = 0.421
r5 = 0.320
r6 = 0.264
r7 = 0.146
r8 = 0.300
r9 = -0.024

The correlation coefficients die off slowly.

Like I said, this is pretty far into the weeds, but I find it interesting, and maybe a few others will as well, including Google searchers who land here. I still have things to figure out; the basic question is, given a time series such as temperature anomalies, what is the statistical uncertainty in the trend (slope) including all relevant autocorrelation lags? (And what does "relevant" mean, exactly?)

It's not an easy question, but it does show why considering "short" time intervals is meaningless. The question is, what does "short" mean? Mathematics is the only thing that can answer that.


Victor Venema said...

Hi David, you may want to search for the words: short range dependence / memory and long range dependence.

They are two different statistical models. The case you discuss most in your post is short range dependence, where the simplest case is an autoregressive process (AR)of order one, in which case you only need to consider the autocorrelation with the previous data point. The order of an AR process tells you how many lagged previous data points you are considering.

Long range dependence makes variability at long time scales much stronger. For the same autocorrelation with respect to the last point or for the same correlation length, you will get much stronger variability at large time scales (scales much larger than the correlation length).

Maybe studying these things using R, will make you life a lot easier. R is a free statistical programming language, which many pre-programmed statistical functions.

Victor Venema said...

Sorry, my last reply was a bit too much science speak. This is beautiful topic. Structure is an under-appreciated topic.

Maybe you can see the difference between long range dependence and short range dependence nicely in Figure 1 of a paper Henning Rust wrote with me. It shows two time series, which have almost the same autocorrelations at short time lags, but which are very different with respect to their variability on large time scales.

Figure 2 of the same paper, shows that this can make quite a difference for the uncertainty in the trends.

Because the autocorrelations for short lag are so similar, it is quite hard to distinguish between short range processes and long range processes, especially if you also allow for trends. You need a lot of data to do so.

If you do so, you will definitely find that only 15 years of data is not enough to say that the warming trend is over. If only you should also take into account that this 15-year period was optimized not to show a trend. That gives you additionally a multiple-testing problem that will increase the uncertainties even more.

If climate scientists would work as sloppy as the "sceptics", the "sceptics" would complain.