Category Archives: math

Geometric argument for constraints on corr(X,Z) given corr(X,Y) and corr(Y,Z)

(Note: This is one of three posts that I wrote some time ago that have just been languishing under the “Misc.” tab of my website for a while, because for whatever reason I didn’t feel that they were a good fit for my blog. Well, I’ve decided to go ahead and move them to the blog, so here you go!)

Link to original TalkStats thread, September 17, 2013.

Today a labmate asked the following question: if we have three random variables x, y, z, and we know the correlations r_{xy} and r_{yz}, what constraints if any does this place on the correlation r_{xz}?

At the time I reflexively answered that the remaining correlation must be the product of the two known correlations. Which of course is totally wrong. I think I was getting some mental interference from some of the equations for simple mediation floating around in my head. Anyway, after thinking about it for a while I have come up with a convincing geometric argument for what the constraints actually are. I have also verified that my answer agrees with a more complicated-looking answer to this question that I found elsewhere online.

Because I ending up spending a lot of time on this and I thought some of you would find the results interesting, I thought I would share my work here. Comments are welcome!

Okay. So imagine our variables x, y, z as vectors in n-dimensional space. The Pearson correlation coefficient between any two of these variables can be interpreted as the cosine of the angle between the corresponding vectors. This is an interesting and well-known geometric fact about correlation coefficients.

So now imagine that the x and y vectors are fixed (and hence so is their correlation), but that the vector z is free to vary so long as r_{yz} is constant. This constraint on r_{yz} means that the set of possible z vectors will form a sort of “cone” around the y vector, as in the following image:

Now it is intuitively obvious (I know this is a sneaky phrase, but that’s why I call this just an “argument” and not a “proof”) that the two possible z vectors that will lead to the minimum/maximum values of r_{xz} are the z vectors that lie on the same plane as the x and y vectors. This leads to the following expression for the minimum/maximum values of r_{xz} given r_{xy} and r_{yz}:

cos[arccos(r_{xy}) \pm arccos(r_{yz})].

One notable result following from this is that if x is orthogonal to y and y is orthogonal to z, then there is no constraint on r_{xz}, it can be anywhere from -1 to +1. But under any other circumstances, fixing r_{xy} and r_{yz} will place some constraint on the range of r_{xz}.

Okay, now for the verification part, which requires a bit of math.

So in this stats.stackexchange.com thread it is stated that the three correlations must satisfy

1+2r_{xy}r_{xz}r_{yz}-(r_{xy}^2+r_{xz}^2+r_{xy}^2) \ge 0,

the reasoning here being because this is the determinant of the correlation matrix and it cannot be negative. Anyway, this can be viewed as a quadratic inequality in r_{xz}, already in standard form:

(-1)r_{xz}^2 + (2r_{xy}r_{yz})r_{xz} + (1 - r_{xy} - r_{yz}) \ge 0.

So if we apply the quadratic formula and simplify the result, we get the following for the minimum/maximum values of r_{xz}:

r_{xy}r_{yz} \pm \sqrt{(1-r_{xy}^2)(1-r_{yz}^2)}.

Now taking my answer and applying the trig identity cos(a \pm b) = cos(a)cos(b) \mp sin(a)sin(b) we get

r_{xy}r_{yz} \pm sin(arccos(r_{xy}))sin(arccos(r_{yz})).

Now applying the identity sin(x) = \sqrt{1 - [cos(x)]^2} we get

r_{xy}r_{yz} \pm \sqrt{(1-r_{xy}^2)(1-r_{yz}^2)},

which is the answer we got from the stackexchange post. So our simpler, geometrically based answer agrees with the more conventional answer that is harder to understand.

Statistical reviewing trick: Testing arbitrary contrasts based on summary statistics

(Note: This is one of three posts that I wrote some time ago that have just been languishing under the “Misc.” tab of my website for a while, because for whatever reason I didn’t feel that they were a good fit for my blog. Well, I’ve decided to go ahead and move them to the blog, so here you go!)

Link to original TalkStats thread, June 24, 2012.

Following the discussion in this recent thread and spunky’s request therein, here is a neat trick one can use to test arbitrary contrasts for ANOVA models, and even multiple degree of freedom tests, using only some basic summary statistics of the kind that would be reported in a manuscript — without needing access to the raw data.

Setup

We need the following three pieces of information to do these tricks:

  • the cell means
  • the sample sizes per cell
  • at least one F-ratio corresponding to any of the possible contrasts from the ANOVA model

So let’s say we have a manuscript on our desk in which the authors conducted a 2*2 factorial ANOVA, with factors A (having levels A_{+1} and A_{-1}) and B (having levels B_{+1} and B_{-1}). They give a table of means and sample sizes by cell, but for whatever reason only report the test of the interaction. So we have the following information:


# cell means
round(tapply(dat$y, list(A = dat$A,B = dat$B), mean), 2)
#     B
# A        -1      1
#   -1 103.54  99.28
#   1   99.59 102.61

# cell sample sizes
table(A = dat$A,B =  dat$B)
#     B
# A    -1  1
#   -1 18 23
#   1  23 16

# t statistic for interaction contrast
summary(lm(y ~ A + B + AB, data=dat))$coef["AB",]
#    Estimate  Std. Error     t value    Pr(>|t|) 
# 1.819051097 0.570385136 3.189162871 0.002073302

Case 1: single degree of freedom tests

Perhaps as curious readers we are interested in knowing whether the A_{+1}, B_{+1} cell differs from the other three cells. In other words we want to test the contrast below labeled “new”:


cbind(contr, new = c(3,-1,-1,-1))
#       A  B AB new
# [1,]  1  1  1   3
# [2,]  1 -1 -1  -1
# [3,] -1  1 -1  -1
# [4,] -1 -1  1  -1

Following the formula here, the F-ratio can be computed as

F = \frac{MSR}{MSE} = \frac{SSR/(p_{large} - p_{small})}{SSE/(N - p_{large})}

where p_{large} and p_{small} are the numbers of parameters in the full model and the nested model, respectively; and N is the total sample size.

So the only two missing quantities here are SSR and SSE. If we can get those we can compute the desired F-ratio.

Given a particular contrast \lambda,

SSR for \lambda = \frac{(\sum_{j=1}^J \lambda_j \bar{Y_j})^2}{\sum_{j=1}^J (\lambda_j^2/n_j)}

where \lambda_j is the contrast weight for group j, \bar{Y_j} is the mean for group j, J is the number of groups, and n_j is the number of observations in group j.

So in this data we have

SSR_{new} = \frac{[3(102.61) - 99.59 - 99.28 - 103.54]^2}{9/16 + 1/23 + 1/23 + 1/18} = 41.67.

Now we need to get SSE. To do this, we can use the same formula to compute SSR for a contrast for which we already know F, and then rearrange the F-ratio formula to solve for SSE.

So for the known interaction contrast we have

SSR_{interaction} = \frac{(102.61 - 99.59 - 99.28 + 103.54)^2}{1/16 + 1/23 + 1/23 + 1/18} = 258.51.

Solving the F formula for SSE gives

SSE = \frac{SSR(N - p_{large})}{F(p_{large} - p_{small})}.

Since t^2 = F, we can now just plug in the numbers to get

SSE = \frac{258.51(80 - 4)}{10.17(4 - 3)} = 1931.84,

so that finally we have

F_{new} = \frac{41.67/(4 - 3)}{1931.84/(80 - 4)} = 1.64.

And we can check our work by running anova() on the dataset after recoding the contrasts:


anova(lm(y ~ other1 + other2 + new, data=dat))
# Analysis of Variance Table
# 
# Response: y
#           Df  Sum Sq Mean Sq F value   Pr(>F)   
# other1     1   38.06  38.056  1.4988 0.224639   
# other2     1  191.60 191.604  7.5462 0.007505 **
# new        1   41.50  41.496  1.6343 0.205002   
# Residuals 76 1929.70  25.391                    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Aside from some minimal rounding error, we have it.

Case 2: multiple degree of freedom tests

Given the same data as above, suppose now that for some reason we wish to see the 2 degree of freedom test comparing the full ANOVA model (factors A, B, and their interaction) to a model that only includes factor A.

We already saw above how to solve for SSE (and in fact we already computed it–we are using the same Full model so SSE will be the same). However, here we will get SSR in a slightly different way, using

SSR = \sum_{i=1}^N(\hat{Y}_{iSmall} - \hat{Y}_{iLarge})^2,

where \hat{Y}_{iSmall} is the predicted value for observation i under the smaller or reduced model, \hat{Y}_{iLarge} is the predicted value for observation i under the larger or full model, and N is the number of observations. Essentially, we are treating the predicted values from the more complex model as the data to be predicted and then computing the sum of squared errors in the normal fashion.

In the ANOVA case, this formula can be written more simply as

SSR = \sum_{j=1}^Jn_j(\hat{Y}_{jSmall} - \hat{Y}_{jLarge})^2,

where n_j is the number of observations in group j, \hat{Y}_{jSmall} is the predicted value for group j under the smaller or reduced model, \hat{Y}_{jLarge} is the predicted value for group j under the larger or full model, and J is the number of groups.

The predicted values from the Large model are straightforward: they are the group means. For the Small model, we have two sets of predicted values, those for A_{+1} and A_{-1}, and in both cases these predicted values are weighted averages of the two cell means at each level (i.e., collapsing across the B factor), weighted by cell size.

For A_{+1}:

\hat{Y}_{A_{+1}Small} = \frac{[23(99.59) + 16(102.61)]}{23 + 16} = 100.83

For A_{-1}:

\hat{Y}_{A_{-1}Small} = \frac{[18(103.54) + 23(99.28)]}{18 + 23} = 101.15

So using the simplified SSR formula, we have

SSR = 18(101.15 - 103.54)^2 + 23(101.15 - 99.28)^2
+ 23(100.83 - 99.59)^2 + 16(100.83 - 102.61)^2 = 269.31,

which makes our F-ratio

F = \frac{269.31/(4 - 2)}{1931.84/(80 - 4)} = 5.30.

Checking our work:


anova(lm(y ~ A, data=dat),
      lm(y ~ A + B + AB, data=dat))
# Analysis of Variance Table
# 
# Model 1: y ~ A
# Model 2: y ~ A + B + AB
#   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
# 1     78 2198.8                                
# 2     76 1929.7  2    269.06 5.2983 0.007013 **
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And again we have it, save for minimal rounding error.