Spearman correlation and ties

r correlation

I'm computing Spearman's rho on small sets of paired rankings. Spearman is well known for not handling ties properly. For example, taking 2 sets of 8 rankings, even if 6 are ties in one of the two sets, the correlation is still very high:

> cor.test(c(1,2,3,4,5,6,7,8), c(0,0,0,0,0,0,7,8), method="spearman")

    Spearman's rank correlation rho

S = 19.8439, p-value = 0.0274

sample estimates:
      rho 
0.7637626 

Warning message:
 Cannot compute exact p-values with ties

The p-value <.05 seems like a pretty high statistical significance for this data. Is there a ties-corrected version of Spearman in R? What is the best formula to date to compute it with a lot of ties?

Well, Kendall tau rank correlation is also a non-parametric test for statistical dependence between two ordinal (or rank-transformed) variables--like Spearman's, but unlike Spearman's, can handle ties.

More specifically, there are three Kendall tau statistics--tau-a, tau-b, and tau-c. tau-b is specifically adapted to handle ties.

The tau-b statistic handles ties (i.e., both members of the pair have the same ordinal value) by a divisor term, which represents the geometric mean between the number of pairs not tied on x and the number not tied on y.

Kendall's tau is not Spearman's--they are not the same, but they are also quite similar. You'll have to decide, based on context, whether the two are similar enough such one can be substituted for the other.

For instance, tau-b:

Kendall_tau_b = (P - Q) / ( (P + Q + Y0)*(P + Q + X0) )^0.5

P: number of concordant pairs ('concordant' means the ranks of each member of the pair of data points agree)

Q: number of discordant pairs

X0: number of pairs not tied on x

Y0: number of pairs not tied on y

There is in fact a variant of Spearman's rho that explicitly accounts for ties. In situations in which i needed a non-parametric rank correlation statistic, i have always chosen tau over rho. The reason is that rho sums the squared errors, whereas tau sums the absolute discrepancies. Given that both tau and rho are competent statistics and we are left to choose, a linear penalty on discrepancies (tau) has always seemed to me, a more natural way to express rank correlation. That's not a recommendation, your context might be quite different and dictate otherwise.

I think exact=FALSE does the trick.

cor.test(c(1,2,3,4,5,6,7,8), c(0,0,0,0,0,0,7,8), method="spearman", exact=FALSE)

    Spearman's rank correlation rho

data:  c(1, 2, 3, 4, 5, 6, 7, 8) and c(0, 0, 0, 0, 0, 0, 7, 8)
S = 19.8439, p-value = 0.0274
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7637626

cor.test with method="spearman" actually calculates Spearman coefficient corrected for ties. I've checked it by "manually" calculating tie-corrected and tie-uncorrected Spearman coefficients from equations in Zar 1984, Biostatistical Analysis. Here's the code - just substitute your own variable names to check for yourself:

ym <- data.frame(lousy, dors) ## my data

## ranking variables
ym$l <- rank(ym$lousy)
ym$d <- rank(ym$dors)


## calculating squared differences between ranks
ym$d2d <- (ym$l-ym$d)^2



## calculating variables for equations 19.35 and 19.37 in Zar 1984

lice <- as.data.frame(table(ym$lousy))

lice$t <- lice$Freq^3-lice$Freq

dorsal <- as.data.frame(table(ym$dors))

dorsal$t <- dorsal$Freq^3-dorsal$Freq

n <- nrow(ym)
sum.d2 <- sum(ym$d2d)
Tx <- sum(lice$t)/12
Ty <-sum(dorsal$t)/12


## calculating the coefficients

rs1 <- 1 - (6*sum.d2/(n^3-n))  ## "standard" Spearman cor. coeff. (uncorrected for ties) - eq. 19.35

rs2 <- ((n^3-n)/6 - sum.d2 - Tx - Ty)/sqrt(((n^3-n)/6 - 2*Tx)*((n^3-n)/6 - 2*Ty)) ## Spearman cor.coeff. corrected for ties - eq.19.37


##comparing with cor.test function
cor.test(ym$lousy,ym$dors, method="spearman") ## cor.test gives tie-corrected coefficient!

Ties-corrected Spearman

Using method="spearman" gives you the ties-corrected Spearman. Spearman's rho, according to the definition, is simply the Pearson's sample correlation coefficient computed for ranks of sample data. So it works both in presence and in absence of ties. You can see that after replacing your original data with their ranks (midranks for ties) and using method="pearson", you will get the same result:
```
> cor.test(rank(c(1,2,3,4,5,6,7,8)), rank(c(0,0,0,0,0,0,7,8)), method="pearson")

Pearson's product-moment correlation

data:  rank(c(1, 2, 3, 4, 5, 6, 7, 8)) and rank(c(0, 0, 0, 0, 0, 0, 7, 8))
t = 2.8983, df = 6, p-value = 0.0274
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1279559 0.9546436
sample estimates:
  cor 
0.7637626 
```
Notice, there exists a simplified no-ties Spearman version, that is in fact used in cor.test() implementation in absence of ties, but it is equivalent to the definition above.
P-value

In case of ties in data, exact p-values are not computed neither for Spearman nor for Kendall measures (within cor.test() implementation), hence the warning. As mentioned in Eduardo's post, for not to get a warning you should set exact=FALSE,

The paper "A new rank correlation coefficient with application to the consensus ranking problem" is aimed to solve the ranking with tie problem. It also mentions that Tau-b should not be used as a ranking correlation measure for measuring agreement between weak orderings.

Emond, E. J. and Mason, D. W. (2002), A new rank correlation coefficient with application to the consensus ranking problem. J. Multi‐Crit. Decis. Anal., 11: 17-28. doi:10.1002/mcda.313

Spearman correlation and ties

Related

Recent Posts