A sensible function to sort dataframes

I am creating a function to sort data.frames (Why? Because of reasons). Some of the criteria:

Works on data.frames
Aimed at non-interactive use
Uses base R only
No dependency on any non-base package

The function looks like this now:

#' @title Sort a data.frame
#' @description Sort a data.frame based on one or more columns
#' @param x A data.frame class object
#' @param by A column in the data.frame. Defaults to NULL, which sorts by all columns.
#' @param decreasing A logical indicating the direction of sorting.
#' @return A data.frame.
#' 
sortdf <- function(x,by=NULL,decreasing=FALSE) {
  if(!is.data.frame(x)) stop("Input is not a data.frame.")

  if(is.null(by)) {
    ord <- do.call(order,x)
  } else {
    if(any(!by %in% colnames(x))) stop("One or more items in 'by' was not found.")
    if(length(by) == 1) ord <- order(x[ , by])
    if(length(by) > 1) ord <- do.call(order, x[ , by])
  }
  
  if(decreasing) ord <- rev(ord)
  return(x[ord, , drop=FALSE])
}

Examples

sortdf(iris)
sortdf(iris,"Petal.Length")
sortdf(iris,"Petal.Length",decreasing=TRUE)
sortdf(iris,c("Petal.Length","Sepal.Length"))
sortdf(iris,"Petal.Length",decreasing=TRUE)

What works so far

Sort data.frame by one or more columns
Adjust overall direction of sort

But, I need one more feature: The ability to set sorting direction for each column separately by passing a vector of directions for each column specified in by. For example;

sortdf(iris,by=c("Sepal.Width","Petal.Width"),dir=c("up","down"))

Any ideas/suggestions on how to implement this?

Update

Benchmark of answers below:

library(microbenchmark)
library(ggplot2)

m <- microbenchmark::microbenchmark(
  "base 1u"=iris[order(iris$Petal.Length),],
  "Maël 1u"=sortdf(iris,"Petal.Length"),
  "Mikko 1u"=sortdf1(iris,"Petal.Length"),
  "arrange 1u"=dplyr::arrange(iris,Petal.Length),
  "base 1d"=iris[order(iris$Petal.Length,decreasing=TRUE),],
  "Maël 1d"=sortdf(iris,"Petal.Length",dir="down"),
  "Mikko 1d"=sortdf1(iris,"Petal.Length",decreasing=T),
  "arrange 1d"=dplyr::arrange(iris,-Petal.Length),
  "base 2d"=iris[order(iris$Petal.Length,iris$Sepal.Length,decreasing=TRUE),],
  "Maël 2d"=sortdf(iris,c("Petal.Length","Sepal.Length"),dir=c("down","down")),
  "Mikko 2d"=sortdf1(iris,c("Petal.Length","Sepal.Length"),decreasing=T),
  "arrange 2d"=dplyr::arrange(iris,-Petal.Length,-Sepal.Length),
  "base 1u1d"=iris[order(iris$Petal.Length,rev(iris$Sepal.Length)),],
  "Maël 1u1d"=sortdf(iris,c("Petal.Length","Sepal.Length"),dir=c("up","down")),
  "Mikko 1u1d"=sortdf1(iris,c("Petal.Length","Sepal.Length"),decreasing=c(T,F)),
  "arrange 1u1d"=dplyr::arrange(iris,Petal.Length,-Sepal.Length),
  times=1000
)
autoplot(m)+theme_bw()

enter image description here

R 4.1.0
dplyr 1.0.7

Here's my attempt, using a function taken from this answer, and assuming up is ascending, and down is descending. dir is set to "up" by default.

sortdf <- function(x, by=NULL, dir=NULL) {
  if(!is.data.frame(x)) stop("Input is not a data.frame.")
  
  if(is.null(by) & is.null(dir)) {
    dir <- rep("up", ncol(x))
  } else if (is.null(dir)) {
    dir <- rep("up", length(by))
  }
  
  sort_asc = by[which(dir == "up")]
  sort_desc = by[which(dir == "down")]

  if(is.null(by)) {
    ord <- do.call(order,x)
  } else {
    if(any(!by %in% colnames(x))) stop("One or more items in 'by' was not found.")
    if(length(by) == 1) ord <- order(x[ , by])
    if(length(by) > 1) ord <- do.call(order, c(as.list(iris[sort_asc]), lapply(iris[sort_desc], 
                                                                               function(x) -xtfrm(x))))
  }
  
   if(length(dir) == 1 & all(dir == "down")) ord <- rev(ord)

  x[ord, , drop=FALSE]
}

You can then have multiple different directions to sort:

sortdf(iris, by=c("Sepal.Width","Petal.Width"), dir=c("up","down")) |>
  head()

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
61           5.0         2.0          3.5         1.0 versicolor
69           6.2         2.2          4.5         1.5 versicolor
120          6.0         2.2          5.0         1.5  virginica
63           6.0         2.2          4.0         1.0 versicolor
54           5.5         2.3          4.0         1.3 versicolor
88           6.3         2.3          4.4         1.3 versicolor

And other examples work as intended as well:

sortdf(iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
9           4.4         2.9          1.4         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa
42          4.5         2.3          1.3         0.3  setosa
4           4.6         3.1          1.5         0.2  setosa

sortdf(iris, c("Petal.Length","Sepal.Length"))
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
23          4.6         3.6          1.0         0.2  setosa
14          4.3         3.0          1.1         0.1  setosa
36          5.0         3.2          1.2         0.2  setosa
15          5.8         4.0          1.2         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa

sortdf(iris, "Petal.Length", "down")
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica
118          7.7         3.8          6.7         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica
132          7.9         3.8          6.4         2.0 virginica
108          7.3         2.9          6.3         1.8 virginica

Here’s another alternative that gets rid of all the branching logic by ensuring you always find a proxy to sort by for each by column with xtfrm(). For consistency with base, instead of using a “new” dir argument, it might also be preferable to keep the decreasing argument, but just allow it to be a vector that’s recycled to match the by length.

sortdf <- function(x, by = colnames(x), decreasing = FALSE, ...) {
  if (!is.data.frame(x)) {
    stop("Input is not a data.frame.")
  }

  if (!all(by %in% colnames(x))) {
    stop("One or more items in 'by' was not found.")
  }
  
  # Recycle `decreasing` to ensure it matches `by`
  decreasing <- rep_len(as.logical(decreasing), length(by))
  
  # Find a sorting proxy for each `by` column, according to `decreasing`
  pxy <- Map(function(x, decr) (-1)^decr * xtfrm(x), x[by], decreasing)
  ord <- do.call(order, c(pxy, list(...)))
  
  x[ord, , drop = FALSE]
}

Thinking about this a bit more, I might even simplify this further and:

Let Map() handle the recycling for by and decreasing.
Let [ handle throwing errors for incorrect indexing (and consequently also accept numeric indices for columns rather than just names).
Not pass ... (following the YAGNI principle).

This could then come down to two one-liner functions:

sortdf <- function(x, by = colnames(x), decreasing = FALSE) {
  x[do.call(order, Map(sortproxy, x[by], decreasing)), , drop = FALSE]
}

sortproxy <- function(x, decreasing = FALSE) {
  as.integer((-1)^as.logical(decreasing)) * xtfrm(x)
}

Examples:

sortdf(iris) |> head()
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14          4.3         3.0          1.1         0.1  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 39          4.4         3.0          1.3         0.2  setosa
#> 43          4.4         3.2          1.3         0.2  setosa
#> 42          4.5         2.3          1.3         0.3  setosa
#> 4           4.6         3.1          1.5         0.2  setosa

sortdf(iris, by = c("Sepal.Length", "Sepal.Width")) |> head()
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14          4.3         3.0          1.1         0.1  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 39          4.4         3.0          1.3         0.2  setosa
#> 43          4.4         3.2          1.3         0.2  setosa
#> 42          4.5         2.3          1.3         0.3  setosa
#> 4           4.6         3.1          1.5         0.2  setosa

sortdf(iris, by = c("Sepal.Length", "Sepal.Width"), decreasing = TRUE) |> head()
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 132          7.9         3.8          6.4         2.0 virginica
#> 118          7.7         3.8          6.7         2.2 virginica
#> 136          7.7         3.0          6.1         2.3 virginica
#> 123          7.7         2.8          6.7         2.0 virginica
#> 119          7.7         2.6          6.9         2.3 virginica
#> 106          7.6         3.0          6.6         2.1 virginica

sortdf(iris, by = c("Sepal.Length", "Sepal.Width"), decreasing = c(TRUE, FALSE)) |> head()
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 132          7.9         3.8          6.4         2.0 virginica
#> 119          7.7         2.6          6.9         2.3 virginica
#> 123          7.7         2.8          6.7         2.0 virginica
#> 136          7.7         3.0          6.1         2.3 virginica
#> 118          7.7         3.8          6.7         2.2 virginica
#> 106          7.6         3.0          6.6         2.1 virginica

A sensible function to sort dataframes

Related

Recent Posts