Saturday, September 01, 2012

On Claims of Data Manipulation

It's commonplace to hear claims, from people who don't like the results, that temperature data has been manipulated. As Raymond Pierrehumbert wrote in Slate, Paul Ryan claimed that in a 2009 op-ed:
"The CRU e-mail scandal reveals a perversion of the scientific method, where data were manipulated to support a predetermined conclusion. The e-mail scandal has not only forced the resignation of a number of discredited scientists, but it also marks a major step back on the need to preserve the integrity of the scientific community. While interests on both sides of the issue will debate the relevance of the manipulated or otherwise omitted data, these revelations undermine confidence in the scientific data driving the climate change debates."
Ryan, who in short order has demonstrated a truth-telling problem with even the small stuff, offers no evidence for such a claim. Fake skeptics like Steve Goddard claim it routinely, again, with no proof or evidence ever offered. It's scurrilous and extremely low.

But there are ways you can test for fraudulent data. One of the simpliest is Benford's Law, which specifies the expected distribution of the digits in any dataset. It's particularly applicable to large datasets that span several orders of magnitude. As Wikipedia explains:
Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time. This distribution of first digits is the same as the widths of gridlines on the logarithmic scale. Benford's law also gives the expected distribution for digits beyond the first, which approach a uniform distribution as the digit place goes to the right.

This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.

...There is a generalization of the law to numbers expressed in other bases (for example, base 16), and also a generalization to second digits and later digits.
Specifically, in a base "b" number system the leading digit "d" should occur with probablility

P(d)=\log_{b}(d+1)-\log_{b}(d)=\log_{b} \left(1+\frac{1}{d}\right).

where to evaluate the base-b logarithm you can use logb(x) = ln(x)/ln(b). 

Awhile back, after I heard about Benford's Law on Radiolab, I applied it to the monthly GISS global anomaly. I multiplied it by 100 to get an integer, and converted it to base 3 so the numbers spanned a few orders of magnitude. (In base 3, an order of magnitude is 3.) I then find the distribution of the digits 1 and 2:

incidence of leading digit being 1 = 62.0% 
incidence of leading digit being 2 = 38.0%

The theoretical values are P3(1)=63.1%, P3(2)=36.9%.

To test a possible manipulation, I took cooked up a simple warming trend: a pure linear trend of +0.01 C per month, so the data read 0.01, 0.02, 0.03, .... That gave

incidence of leading digit being 1 = 69.1% 
incidence of leading digit being 2 = 30.9% 

which are much further from the expected distribution, making them suspicious.

I didn't get more sophisticated than this, because I didn't find an efficient way to convert to an arbitrary base using Excel, and I don't believe data is being manipulated anyway -- it's all too consistent between groups, there's never been the slightest hint of any manipulation, and the people making the accusations have never offered any proof or evidence and are so usually dishonest about everything else I didn't see the point in going further. This is more proof than they've ever given.

So, for whatever it's worth, here it is: a simple test that detects no fraud.

21 comments:

charlesH said...

Are you using RAW GISS data or the adjusted GISS data?

Jon said...

Interesting - I hadn't seen that particular excerpt of what Ryan wrote before. Someone should ask him to name these "discredited scientists" who have supposedly been forced to resign because of the emails.

David Appell said...

What do you mean by "raw GISS data?"

Papa Zu said...

@ "Fake skeptics"

When you say fake sceptics the first person that comes to mind is the cartoonist over at the Skeptical Science blog.

Fake skeptics aren't as bad a problem as "fake scientists" who are willing to pressure journal editors in order to prevent some other scientist's work from being published in that journal.

David Appell said...

What scientist(s) have pressured an editor, and which editor?

charlesH said...

David,

You don't know the difference between raw and adjusted temp data? You need to get out more.

Dano said...

You need to get out more.

You need to be more intellectually honest. What is in it for you to be so dishonest?

Best,

D

David Appell said...

Of course I know the difference. But GISS doesn't collect any data -- they use that collected by others, mostly that in the GHCN. GISS *is* the adjusted data, adjusting for gaps in space and time, the UHI, geographical weighting, and other necessary factors to create a meaningful, long-term, global and hemispheric averages.

charlesH said...

1) I think your method is just silly. Somehow you believe it will detect warming bias in the data.

2) This test is just silly. As if you believe anyone believes this kind of manipulation is the problem.

"To test a possible manipulation, I took cooked up a simple warming trend: a pure linear trend of +0.01 C per month, so the data read 0.01, 0.02, 0.03, ...."

What don't you start with what other demonstrate to be the old data compared to the new data and explain way the adjustments have been made.

charlesH said...

David,

Please explain the 1999 version and the 2012 version of US temp trends.


http://stevengoddard.wordpress.com/data-tampering-at-ushcngiss/

David Appell said...

Charles: What research into the scientific literature have you done to answer your own question?

PS: I do not trust a single thing Steve Goddard says. I read his blog sometimes and find it riddled with errors, and his "analhsis" to sometimes be outright dishonest.

Dano said...

That is a great example: using Steven Soddard as a source is intellectually dishonest.

That's how you do it.

Best,

D

David Appell said...

Re: adjustments

Charles, how about first doing some research of your own before making insinuations:

GISS Surface Temperature Analysis
Updates to Analysis
http://data.giss.nasa.gov/gistemp/updates_v3/

charlesH said...

First.

Do you accept the 1999 version and 2012 version as presented correctly?

Do you take issue with any of the analysis presented? If so, what?



David Appell said...

Charles, what research into the scientific literature have you done to answer your own question?

No, I don't trust whatever numbers Steve Goddard says are the data. I don't trust a single thing he says -- I've seen too much carelessness and pure dishonesty from him. I'm not starting on a wild goose chase based on some graph such a blogger threw up.

Really, you need to raise your standards.

charlesH said...

My research includes asking "warmers" if they take issue with what he presents in the post in question.

Since you didn't point out any problems I will have to look elsewhere.

The longer I go asking the question without a substantive response the more I think his data is correct.

If I ask a "skeptic" if they take any issue with Mann's hockey stick I get a lot of references to CA.

I think it is a quite effective way to hear both sides and make a decision as to whom to believe.

David Appell said...

So you don't even bother to look in the scientific literature? You just assume Goddard is right, unless proven wrong....

That's a pretty lousy way of doing research, Charles (though I see how it saves you a lot of work and thinking). You need to scrape your standards off the floor and aim higher.

charlesH said...

1) Goddard is just posting official data for 1999 and 2012. Either his data is posted correctly or it isn't.

Since no one contests his posted data I assume it's correct.

2) Next, why have the adjustments been made?

Paul S said...

charlesH,

GISS use the USHCN dataset for CONUS temperature data. Since 1999 USHCN have substantially reviewed and revised their procedures for dealing with record discontinuities such as TOBS, station siting changes etc., hence the GISS CONUS result is different. Note that BEST's CONUS reconstruction has been produced independently of the USHCN adjustments and gives substantially similar results.

/Can you guess the soup of the day?

Dano said...

I _knew_ that after ClimateGate broke that the glaciers would start advancing again, the oceans recede, the rain patterns return, droughts lessen, plants move south, animals move downhill, whales return to old migration patterns, fish swim south, Arctic ice recover, and so on.

This Goddard guy has just proven that the GLOBUL KINSPEERCY of climategist...climologists...clim...scaaaaaaaaaaaaaaaaaaaan-tists were done movin them glayshers an' pumpin up the ocean so they'd get them some ree-serch munny!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

chuckle

Best,

D

charlesH said...

GISS use the USHCN dataset for Paul.
"CONUS temperature data. Since 1999 USHCN have substantially reviewed and revised their procedures for dealing with record discontinuities such as TOBS, station siting changes etc., hence the GISS CONUS result is different. Note that BEST's CONUS reconstruction has been produced independently of the USHCN adjustments and gives substantially similar results."

Thank you for the explanation. Sounds like were are back to the UHI/siting issues that have been in the blogoshpere recently.