Rfast segmentation fault on independence test

I am having troubles using the G2-test function of the Fast function in R since it outputs a segmentation fault even though it seems to me that the input parameters are correct.

More specifically, I am able to run the example code in the manual page

nvalues <- 3
nvars <- 10
nsamples <- 5000
data <- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars )
dc <- rep(nvalues, nvars)

res<-g2Test( data, 1, 2, 3, c(3, 3, 3) )

But I'm not able to make it run on my data. The function g2Test takes as input a matrix of numbers, three integer that stands for the column on which to condition (in the example we are studying the dependence of the first on the second conditioned on the third) and a vector with the number of unique values per column.

My code follows the same principles reading data from the ALARM csv file

library(readr)
library(Rfast)

# open the file
path <-  "datasets/alarm.csv"
dataset <- read.csv(path)
# search for the indexes of the column I'm interested in and the amount of unique values per column
c1 <- "PVS"
c2 <- "ACO2"
s <- c("VALV", "VLNG", "VTUB",   "VMCH")
n <- colnames(dataset) 
col_c1 <- match(c1, n)
col_c2 <- match(c2, n)
cols_c3 <- c()
uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])
if (!s[1]=="()"){
 for(v in s){
   idx <- match(v, n)
   cols_c3 <- append(cols_c3,idx)
   uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
 }
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
  dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}
ds <- as.matrix(dataset)
colnames(ds) <- NULL

# running the G2 test
res <- g2Test(ds, col_c1, col_c2, cols_c3, uni)

But it results into a segmentation fault

 *** caught segfault ***
address 0x1f103f96a, cause 'memory not mapped'

Traceback:
 1: g2Test(ds, col_c1, col_c2, cols_c3, uni)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

The same happens if I condition on just one variable and not on multiple ones.

I really don't understand why this happens since it seems to me that my case is the same as the example on the reference, just with different data. I would really appreciate any help for debugging this issue, please tell me if I need to specify further infos.


Solution 1:

First, I'm sorry that I missed that you had originally included your data!

Alright, I wish I would have realized this sooner (as you will, as well...). The columns have to be consecutive and the values must start at zero. So what does that mean? You have to rearrange the columns so that col_c1 is the first column, col_c2 is the second column, and so on. You have to subtract all values by one (since the lowest value is 1).

This is what I did (and how I checked it):

# there was no PVS, I assume this was PVSAT
c1 <- "PVSAT"
# c1 <- "PVS"

# there was no ACO2, I assume this was ARTCO2
c2 <- "ARTCO2"
# c2 <- "ACO2"

# there are no columns with these names...
# for VALV - VENTALV; for VLNG - VENTLUNG; for VTUB - VENTTUBE; for VMCH - VENTMACH
s <- c("VENTALV", "VENTLUNG", "VENTTUBE", "VENTMACH")
# s <- c("VALV", "VLNG", "VTUB", "VMCH")

This next chunk is exactly as you wrote it:

n <- colnames(dataset) 

col_c1 <- match(c1, n)
col_c2 <- match(c2, n)

cols_c3 <- c()

uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])

if (!s[1]=="()"){
  for(v in s){
    idx <- match(v, n)
    cols_c3 <- append(cols_c3,idx)
    uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
  }
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
  dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}

ds <- as.matrix(dataset)

This is where I made the minimum zero:

# look at the number of unique values before changing, as a means of validation
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))
# look at the minimum, as a means of validation
sapply(1:ncol(ds), function(x) min(ds[,x]))
# the minimum value must be zero
ds <- ds - 1
# check
sapply(1:ncol(ds), function(x) min(ds[,x]))
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))

# looked as expected

Next, I rearranged the columns. I did this before removing the names so I could use the names to ensure the order was correct.

# the data must be consecutive numbers
# catch names before and after
n2 <- dimnames(ds)
# some of the results from this:
# [[2]]
#  [1] "HISTORY"      "CVP"          "PCWP"         "HYPOVOLEMIA"

# create the list of column indicies other than those getting called in g2Test
tellMe <- c(1:ncol(ds))
tellMe <- tellMe[-c(col_c1, col_c2, sort(cols_c3))] 

# rearrange using the indices
ds <- ds[, c(col_c1, col_c2, sort(cols_c3), tellMe)]

# check it
(n3 <- dimnames(ds))
# some of the results from this
# [[2]]
#  [1] "PVSAT"        "ARTCO2"       "VENTMACH"     "VENTTUBE"

All that's left is removing the names (just as you did) and then calling the function. Since the indices changed, your objects won't work here, though.

colnames(ds) <- NULL

# running the G2 test
# res <- g2Test(ds, col_c1, col_c2, sort(cols_c3), uni)
res2 <- g2Test(ds, 1, 2, c(3,4,5,6), c(3, 3, 4, 4, 4, 4))
# $statistic
# [1] 19.78506
# 
# $df
# [1] 1024
#