Friday, October 17, 2008

Linear Correlations - A Love Story

When I was a dumb undergraduate, a sophomore I think, I decided I wanted to take a one-on-one course with a professor and do something experimental. He didn't really care what and sent me off on my own. I decided to try and measure the change of the index of refraction of glass with temperature.

One afternoon a week I went to a little room with a bunch of old equipment, including an oven and a spectrometer. I found some glass slides and a lamp and a thermometer and started to make measurements, heating up the glass, watching diffraction fringes change, etc. etc.

At the end I plotted my results, and they were all over the place. I mean, simply scattershot. I took out my trusted HP calculator (remember Reverse Polish Notation?) and plugged away and came up with a best-line fit. And when I took it to my professor (Bryon Dieterle), he pretty much laughed right in my face and told me you can't just draw a stupid line between a seemingly random set of results, and anyway, I hadn't considered the uncertainties of my data (or my result) and that was actually most of experimental science. I felt pretty humiliated -- I could go back and show you the exact spot in the hallway he told me all this, almost like it was the Kennedy assassination. Then he explained uncertainties in terms of partial derivatives and it started to make sense. I think he gave me an A-.

I was always a lousy experimentalist -- I still panic at having to change a flat tire -- but in fact that humiliation taught me a great deal about what data analysis was all about and insisting on good data and precise data and not being stupid about it and all that. Anyway.

So when I see something like this

from the blog Stochastic Democracy, more or less endorsed by Matthew Yglesias here, I have to laugh. You can't just take a scatterhot of points and draw a line through it because that's what your calculator (now Excel) tells you. Sure, you can, but it's meaningless -- it's a blob! -- and it's more important to understand that it's meaningless than going through the nitty gritty details of calculation slopes and intercepts and Pearson coefficients.

I don't know the moral of this story, except that you can do a lot of stupid things with the linear correlation function on your spreadsheet. Be sure to think first.


David said...

Fair point, but it is not a correlation graph. (Though I've taken the pictures down since a lot of people seem to have thought it was)

Based on the summary statistics I made available on the page, the marginal effect of population density on party identification is almost certainly positive(though small). The estimates might be a bit off, but I meant for this to be more of a quickie check than an expansive analysis.

But you are right, the graph was rather meaningless. But in my partial defense, I'm still working on how to present statistics without getting too technical, and I thought some pictures might help with accessibility.

All The Best,

David Shor

Dano said...

Yes. Thank you David.