Show correlations as an ordered list, not as a large matrix
I've a data frame with 100+ columns. cor() returns remarkably quickly, but tells me far too much, especially as most columns are not correlated. I'd like it to just tell me column pairs and their correlation, ideally ordered.
In case that doesn't make sense here is an artificial example:
df = data.frame(a=1:10,b=20:11*20:11,c=runif(10),d=runif(10),e=runif(10)*1:10)
z = cor(df)
z looks like this:
a b c d e
a 1.0000000 -0.9966867 -0.38925240 -0.35142452 0.2594220
b -0.9966867 1.0000000 0.40266637 0.35896626 -0.2859906
c -0.3892524 0.4026664 1.00000000 0.03958307 0.1781210
d -0.3514245 0.3589663 0.03958307 1.00000000 -0.3901608
e 0.2594220 -0.2859906 0.17812098 -0.39016080 1.0000000
What I'm looking for is a function that will instead tell me:
a:b -0.9966867
b:c 0.4026664
d:e -0.39016080
a:c -0.3892524
b:d 0.3589663
a:d -0.3514245
b:e -0.2859906
a:e 0.2594220
c:e 0.17812098
c:d 0.03958307
I have a crude way to get rid of some of the noise:
z[abs(z)<0.5]=0
then scan looking for non-zero values. But it is far inferior to the desired output above.
UPDATE: Based on the answers received, and some trial and error, here is the solution I went with:
z[lower.tri(z,diag=TRUE)]=NA #Prepare to drop duplicates and meaningless information
z=as.data.frame(as.table(z)) #Turn into a 3-column table
z=na.omit(z) #Get rid of the junk we flagged above
z=z[order(-abs(z$Freq)),] #Sort by highest correlation (whether +ve or -ve)
Solution 1:
I always use
zdf <- as.data.frame(as.table(z))
zdf
# Var1 Var2 Freq
# 1 a a 1.00000
# 2 b a -0.99669
# 3 c a -0.14063
# 4 d a -0.28061
# 5 e a 0.80519
Then use subset(zdf, abs(Freq) > 0.5)
to select significant values.
Solution 2:
library(reshape)
z[z == 1] <- NA #drop perfect
z[abs(z) < 0.5] <- NA # drop less than abs(0.5)
z <- na.omit(melt(z)) # melt!
z[order(-abs(z$value)),] # sort
Solution 3:
Building off of @Marek's answer. Eliminates diagonal and duplicates
data = as.data.frame( as.table( z ) )
combinations = combn( colnames( z ) , 2 , FUN = function( x ) { paste( x , collapse = "_" ) } )
data = data[ data$Var1 != data$Var2 , ]
data = data[ paste( data$Var1 , data$Var2 , sep = "_" ) %in% combinations , ]