It sometimes happens that we have only limited access to direct measurements of some variable of interest, but we have abundant data on some other, related variable. In such cases we can use the “other” variable as a *proxy* for our target variable. We attempt to determine the relationship between them, then use the measurement of one as input to that relationship in order to estimate the other. Voila! Of course such indirect estimates will be imperfect, but at least they’re an approximation, we hope a useful one. Why, such practice has even been applied to climate science.

Let be the predictor variable (the “proxy” variable), for which we have lots of data, and let be the “target” variable. Suppose futher that there’s a strict linear relationship between them

.

In such a case, if we know then we can determine exactly, **as long as** we know the two defining parameters and . To determine those, all we need is at least two cases for which we know both and . Then the proxy is a perfect proxy for the target , which we can determine without error. How wonderful! It’s also unusual, or as they say in these parts, “T’aint likely.”

A much more common case is that we can’t directly observe the target variable without some kind of noise entering into the mix; what we actually observe is which includes some random variation

.

We expect to be a random variable, and usually expect it to have zero mean. It might be measurement error, it might be a random physical process which contributes to , but it’s in addition to the definitive value which is a simple consequence of the influence of . We now have the model

,

where is random. We often even assume that the random term follows a normal distribution with zero mean and unknown standard deviation. We can still estimate the parameters and from samples of and , by regression (usually least-squares regression). The least-squares regression also enables us to estimate the standard deviation in (which we’ll call ) as well.

Knowing all those parameters, if we have a value for for which we don’t know the corresponding values of and , we can estimate them using our model. We can use as a proxy for , but it’s no longer a perfect proxy — not because the relation between and is imperfect, but because our estimates of the parameters and are imperfect (since they’re based on noisy data). But at least the estimated values of and are *unbiased*, i.e., we expect to get the correct result, the imperfection is only random and doesn’t preferentially lean one way or the other.

We can also use as a proxy for , in which case we can compute a probability distribution for . The variance of will be the variance of the random component , while the expected value of will be (which we estimate from ).

An example may clarify. I generated values following a uniform distribution from -1 to +1, used the relationship (so the correct parameters are and , and added noise to the values to get values; here’s a scatterplot of against (the solid line represents the correct relationship ):

If we fit a line by least-squares regression we get a nearly perfect estimate of the correct relationship:

We can even confirm that the relationship is at least approximately linear by computing averages over small intervals of :

Clearly and are (at least approximately) linearly related, clearly our regression line is almost exactly the true relationship. There’s no problem using as a proxy for or , everything is hunky-dory and peachy-keen.

## Imperfect Proxies

Now let’s turn things around a bit. Let’s suppose there’s no noise in the response variable, so we get to observe instead of the noisy version . But let’s also suppose that there **is** noise in the predictor variable, so instead of observing we observe

,

where is a random component (possibly genuine signal, perhaps just measurement error) in the predictor variable which follows a normal distribution with mean zero and variance . Here’s some sample data according to this prescription:

As expected, it looks a lot like the previous artificial data, except now we’ve added noise to to give , rather than adding noise to to get .

If we now estimate the relationship by fitting a line with linear regression, we get an estimate which is noticeably different from the true relationship between and :

The slope of our regression line is significantly less than the slope of the true relationship line. Does this mean that we should abandon this regression if we want to use as a proxy for ?

Not necessarily! Let’s again compute averages of the values over small intervals of :

Note that the averages no longer follow a straight line. As we get nearer to the edges of the observed interval, the relationship between “average ” and deviates more and more from the relationship between and .

And that’s as it should be, because no longer represents a direct measurement of . Instead it’s a combination of and the random term . Suppose, for instance, that is a person’s true height, is measurement error, and is the measured or estimated height. Then an observed value of 8 feet tall probably does **not** represent a true height of 8 feet tall. It’s more likely that the true height is on the vary tall side, but the measurement error is also positive, so the most likely value of is a bit less than 8 feet. In that case, the most likely value of will be less than it would be if were truly 8 feet tall.

This emphasizes that when we use one variable as a proxy for another, what we really get is the *conditional probability* for the target, given the proxy. When is 8 feet, the measurement error is more likely to be positive than negative because we know *a priori* that it’s much more likely for people to be shorter than 8 feet than it is for them to be taller than that. Essentially we have a *prior distribution* of values which informs our best estimate of given , which in turn informs our best estimate of or , given .

In such a case, the regression line is a good linear estimate of the conditional expectation of (or ) given . If we then had samples of to use as a proxy in order to estimate or then the regression line is the right choice **if** (and only if) the proxy values are drawn from the same distribution as the calibration data.

## Terra Incognita

In the previous example, we’d use the regression line to interpret as a proxy, in spite of its significant deviation from the relationship involving . This is because our regression line takes into account the prior information, namely, the conditional probability for given .

But suppose that we now want to use data as a proxy, but the new values are **not** drawn from the same distribution as those we used to determine the regression line (to calibrate the proxy). We might have used, say, the volume of a rock as a proxy for its weight. To calibrate that proxy, we have a near-perfect device for measuring weight but our volume measurements are imperfect guesses. And, we calibrated our proxy using only rocks from 2000 to 4000 cubic centimeters (cm^{3}), but we’ll use our proxy relationship to estimate the weight of rocks from 200 to 400 cm^{3}. Here’s the calibration data, with the true relationship shown as a black line and the regression-fit relationship as a red line:

Here’s the calibration data together with the new data (plotted as estimated volumes and true weightes), with both the true-relationship line and the regression line extrapolated to the much lower-volume range:

Clearly the proxy is off — it overestimated weightes because the regression fit underestimated the slope of the true-relationship line. And since our new data are drawn from a different distribution than the calibration data, the regression line no longer represents the conditional expection of weight, given the estimated volume.

What we really want in this case is the get the slope of the **true**-relationship line; only then is it valid to extrapolate so far outside the range of the calibration data. But how would we get that?

If (and only if!) we know *a priori* that the weight data used for calibration are noise-free (or so near to noise-free that we can safely make that assumption), then we can find the true-relationship slope by *inverse regression*. Instead of regressing weight on volume, we regress volume on weight. That gives us this:

Note that the regression line is so near the true-relationship line that they’re indistinguishable. If we use the inverse-regression line to extrapolate to the new range, we get this:

Again everything is hunky-dory.

## Double Trouble

Unfortunately things are rarely so simple. The foregoing assumes that the target variable is noise-free, but that’s rarely the case (’tain’t likely). And then there’s the much more realistic situation in which there might be a strict relationship between and , but we don’t get to observe either one noise-free. Instead we observe and , where both and are random. They might be zero-mean and normally distributed, but both our variables have noise. And of course, we don’t want to restrict our proxy to values drawn from the same distribution as the calibration data, we need to extrapolate it to unknown territory, so we really need to estimate the true-relationship parameters.

That particular problem is known as “errors-in-variables” regression. If we know the ratio of the variances and of the proxy and the target noise, then we can estimate the true-relationship slope by a process called *total least squares*. Ordinary least squares finds the coefficients which minimizes the sum of squared errors in the target:

,

where are the model values. Total least squares minimizes the total sum of squared errors

.

To solve this problem we only need to know the *ratio* of to . But we still need *a priori* information about them. Sometimes such information is avaiable. We might, for instance, have multiple measurements of and for the same physical case, which enables us to estimate those variances. But it’s more usual that such is not the case. One might even say, ’tain’t likely.

However, sometimes we don’t even have a clue about the variance ratio. In that case, we can get a valid estimate in some cases. It sometimes works to use a slope estimate other than the usual (least-squares) value, based on higher moments of the distribution for the observed values. But there are cases even when that doesn’t work — if, for instance, the distribution of the noise-free variable is normal, and the noise variables and are also normally distributed, then there’s just no way for us to identify from the raw data alone, the correct ratio of the noise variances, or the slope of the true-relationship line. Fortunately, the “insoluble case” is pretty rare too.

And what does this have to do with climate science? There’s a new paper by Ammann et al. (Climate of the Past, 6, 273-279, 2010) which proposes a method to improve the proxy calibration by using part of the data to compute a correction factor. Different parts are used for the correction factor calculation, *a la* cross-validation. The net result is that the method indicates that the amount of variance in past climate is a bit greater than previously estimated based on proxy data.

I was just trying to show someone over on WUWT why a 50% (0.5) correlation coefficient was not “pathetic” but actually good enough to break out the champagne.

I think I got something out of this. I use regression frequently, and I assume that the predictor variable is known without error, even though I’m sure it’s not. I’ll read this again to try and understand what I’m supposed to do about it. I liked the example of the measurement of 8 ft tall. We’ve had several such extreme measurements at work recently, and I think we’ve tended to assume normal error around them.

They speak of attenuation in the results, not really surprising. Even floating averaging hides the true rate of change. Add in lags, thresholds, seasonal timing not to mention other factors I doubt I could have got much of a signal at all.

The more we understand the past the better we will understand the fututre. (Just as well I am not a statistician.)

Thank you for explaining it in a way I can almost understand.

Well done, Tamino. you made this analysis comprehensible to an utter novice in statistics.

The Ammann et al. article is clearly written and worth reading. Tamino’s purpose wasn’t immediately clear to me until I went on to read the article, and realized that I’d been led to that door by a well-laid path of stepping stones.

I knew that errors in variables tend to attenuate (bias towards zero) regression slopes, but hadn’t thought about how this might affect proxy temperature reconstructions until staring at Ammann’s Figure 2.

Thanks.

Perhaps a better title would be

“Progression of regression for progression* or regression**? ”

*”correction for bias in

the signal amplitude”

**”increased variance”

Dead on Tamino. Very good explanation of important statistical methods.

Good read related to this topic:

http://www.statsoft.com/textbook/multiple-regression/

I think I got something out of this. I use regression frequently, and I assume that the predictor variable is known without error, even though I’m sure it’s not. I’ll read this again to try and understand what I’m supposed to do about it. I liked the example of the measurement of 8 ft tall. We’ve had several such extreme measurements at work recently, and I think we’ve tended to assume normal error around them.