Over at Upturned Earth, John Schwenkler has asked for an eighth-grade-level refresher course on what an r-squared value means .
Given how prone to misinterpretation the correlation coefficient is, it’s a little bit easier to talk about what it doesn’t mean. This is also a useful exercise in understanding the limits of science:
1) It doesn’t mean the probability that something causes something else.
Correlation does NOT imply causation .
2) It doesn’t even mean the probability that something and something else are correlated.
Things can be correlated in all kinds of ways. The r-squared value only measures (in a weird way that we’ll discuss soon) the probability that two things are linearly correlated. Once upon a time, physicists wrought havoc upon the sciences by writing papers claiming all kinds of correlations that didn’t actually exist. It’s rather easy to ascribe correlations to things that are not, in fact, correlated. Don’t succumb to that temptation.
3) It doesn’t even mean the probability that something and something else are linearly correlated.
Statistics can’t actually tell you the probability of something being the case without additional assumptions. The oft-abused p-values are not, as most people interpret them, equivalent to one minus the probability that a given relationship exists. Rather, they are the probability that assuming nothing but chance is at work, the given situation might be observed. This common misconception naturally extends to r-squared numbers: just consider Anscombe’s quartet .
So what on earth does it mean?
In as few words as possible, the r-squared value represents the fraction of the variability in a data-set that can be accounted for by the statistical model (in a drearily frequentist way). As for what that actually means , statisticians aren’t really able to come to any agreement. Welcome to the wonderful world of Damned Lies .