What is the algorithm behind R core's `split` function?

r split

split is an especially important function in R core. Many Stack Overflow answers offering R-base solutions on data manipulation rely on it. It is the workhorse routine of any group-by operations.

There are also many questions whose solution is just a single line with split. Many people do not know that

split.data.frame can split a matrix by row;
split.default can split a data frame by column.

Perhaps R documentation on split is not doing very well. It does mention the first use, but does not mention the second.

There are four methods for split in R core:

methods(split)
#[1] split.data.frame split.Date       split.default    split.POSIXct

I will provide an answer explaining in depth how split.data.frame, split.default and the C-level .Internal(split(x, f)) work. Other answers are welcomed on "Date" and "POSIXct" object.

How does `split.data.frame` work?

function (x, f, drop = FALSE, ...) 
lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), 
       function(ind) x[ind, , drop = FALSE])

It calls split.default to split row index vector seq_len(nrow(x)), then use an lapply loop to extract associated rows into a list entry.

This isn't strictly a "data.frame" method. It splits any 2-dimensional objects by the 1st dimension, including splitting a matrix by rows.

How does `split.default` work?

function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...) 
{
if (!missing(...)) 
    .NotYetUsed(deparse(...), error = FALSE)
if (is.list(f)) 
    f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f)) 
    f <- as.factor(f)
else if (drop) 
    f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class"))) 
    return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}

if x has no classes (i.e., mostly an atomic vector), .Internal(split(x, f)) is used;
otherwise, it uses .Internal(split()) to split the index along x, then uses a for loop to extract associated elements into a list entry.

An atomic vector (see ?vector) is a vector with the following mode:

"logical", "integer", "numeric", "complex", "character" and "raw"
"list"
"expression"

An object with class... Er... there are so many!! Let me just give three examples:

"factor"
"data.frame"
"matrix"

In my opinion the split.default is not well written. There are so many objects with classes, yet split.default would deal with them in the same way via"[". This works fine with "factor" and "data.frame" (so we will be splitting data frame along the columns!), but it definitely does not work with a matrix in a way we expect.

A <- matrix(1:9, 3)
#     [,1] [,2] [,3]
#[1,]    1    4    7
#[2,]    2    5    8
#[3,]    3    6    9

split.default(A, c(1, 1, 2))  ## it does not split the matrix by columns!
#$`1`
#[1] 1 2 4 5 7 8
#
#$`2`
#[1] 3 6 9

Actually recycling rule has been applied to c(1, 1, 2), and we are equivalently doing:

split(c(A), rep_len(c(1,1,2), length(A)))

Why doesn't R core write another line for a "matrix", like

for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]

Till now the only way to safely split a matrix by columns is to transpose it, then split.data.frame, then another transpose.

lapply(split.data.frame(t(A), c(1, 1, 2)), t)

Another workaround via lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix) is buggy if A is a character matrix.

How does `.Internal(split(x, f))` work?

This is really the core of the core! I will take a small example below for explanation:

set.seed(0)
f <- sample(factor(letters[1:3]), 10, TRUE)
# [1] c a b b c a c c b b
#Levels: a b c

x <- 0:9

Basically there are 3 steps. To enhance readability, Equivalent R code are provided for each step.

step 1: tabulation (counting occurrence of each factor level)

## a factor has integer mode so `tabulate` works
tab <- tabulate(f, nbins = nlevels(f))
[1] 2 4 4

step 2: storage allocation of the resulting list

result <- vector("list", nlevels(f))
for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
names(result) <- levels(f)

I would annotate this list as follows, where each line is a list element which is a vector in this example, and each [ ] is a placeholder for an entry of that vector.

$a: [ ] [ ]

$b: [ ] [ ] [ ] [ ]

$c: [ ] [ ] [ ] [ ]

step 3: element allocation

Now it is useful to uncover the internal integer mode for a factor:

.f <- as.integer(f)
#[1] 3 1 2 2 3 1 3 3 2 2

We need to scan x and .f, filling x[i] into the right entry of result[[.f[i]]], informed by an accumulator buffer vector.

ab <- integer(nlevels(f))  ## accumulator buffer

for (i in 1:length(.f)) {
  fi <- .f[i] 
  counter <- ab[fi] + 1L
  result[[fi]][counter] <- x[i]
  ab[fi] <- counter
  }

In the following illustration, ^ is a pointer to elements that are accessed or updated.

## i = 1

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
     ^

ab: [0] [0] [0]  ## on entry
             ^

$a: [ ] [ ]

$b: [ ] [ ] [ ] [ ]

$c: [0] [ ] [ ] [ ]
     ^

ab: [0] [0] [1]  ## on exit
             ^

## i = 2

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
         ^

ab: [0] [0] [1]  ## on entry
     ^

$a: [1] [ ]
     ^
$b: [ ] [ ] [ ] [ ]

$c: [0] [ ] [ ] [ ]

ab: [1] [0] [1]  ## on exit
     ^

## i = 3

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
             ^

ab: [1] [0] [1]  ## on entry
         ^

$a: [1] [ ]

$b: [2] [ ] [ ] [ ]
     ^
$c: [0] [ ] [ ] [ ]

ab: [1] [1] [1]  ## on exit
         ^

## i = 4

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                 ^

ab: [1] [1] [1]  ## on entry
         ^

$a: [1] [ ]

$b: [2] [3] [ ] [ ]
         ^
$c: [0] [ ] [ ] [ ]

ab: [1] [2] [1]  ## on exit
         ^

## i = 5

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                     ^

ab: [1] [2] [1]  ## on entry
             ^

$a: [1] [ ]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [ ] [ ]
         ^

ab: [1] [2] [2]  ## on exit
             ^

## i = 6

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                         ^

ab: [1] [2] [2]  ## on entry
     ^

$a: [1] [5]
         ^
$b: [2] [3] [ ] [ ]

$c: [0] [4] [ ] [ ]

ab: [2] [2] [2]  ## on exit
     ^

## i = 7

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                             ^

ab: [2] [2] [2]  ## on entry
             ^

$a: [1] [5]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [6] [ ]
             ^

ab: [2] [2] [3]  ## on exit
             ^

## i = 8

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                 ^

ab: [2] [2] [3]  ## on entry
             ^

$a: [1] [5]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [6] [7]
                 ^

ab: [2] [2] [4]  ## on exit
             ^

## i = 9

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                     ^

ab: [2] [2] [4]  ## on entry
         ^

$a: [1] [5]

$b: [2] [3] [8] [ ]
             ^
$c: [0] [4] [6] [7]

ab: [2] [3] [4]  ## on exit
         ^

## i = 10

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                         ^

ab: [2] [3] [4]  ## on entry
         ^

$a: [1] [5]

$b: [2] [3] [8] [9]
                 ^
$c: [0] [4] [6] [7]

ab: [2] [4] [4]  ## on exit
         ^

Error while trying to borrow 2 fields from a struct wrapped in RefCell

How to extract a node attribute from XML using PHP's DOM Parser

Refreshing GUI by another thread in java (swing)

One VS2010 bug ? Allowing binding non-const reference to rvalue WITHOUT EVEN a warning?

What default promotions of types are there in the variadic arguments list?

No semicolon before [] is causing error in JavaScript

SimpleDateFormat parse(string str) doesn't throw an exception when str = 2011/12/12aaaaaaaaa?

How unique is Python's id()?

How exactly do Regular Expression word boundaries work in PHP?

Convert Python strings into floats explicitly using the comma or the point as separators

convert hex code to color name

Can I use a PDO prepared statement to bind an identifier (a table or field name) or a syntax keyword?