Returning above and below rows of specific rows in r dataframe

Consider any dataframe

            col1   col2    col3   col4
row.name11    A     23      x       y
row.name12    A     29      x       y
row.name13    B     17      x       y
row.name14    A     77      x       y

I have a list of rownames which I want to return from this dataframe. Lets say I have row.name12 and row.name13 in a list. I can easily return these rows from dataframe. But I also want to return 4 rows above and 4 rows below these rows. It means I want to return from row.name8 to row.name17. I think it is similar to grep -A -B in shell.

Probable solution- Is there any way to return row number by row name? Because if I have row number than I can easily subtract 4 and add 4 in row number and return rows.

Note: Here rownames are just examples. Rownames could be anything like RED, BLUE, BLACK, etc.


Solution 1:

Try that:

extract.with.context <- function(x, rows, after = 0, before = 0) {

  match.idx  <- which(rownames(x) %in% rows)
  span       <- seq(from = -before, to = after)
  extend.idx <- c(outer(match.idx, span, `+`))
  extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
  extend.idx <- sort(unique(extend.idx))

  return(x[extend.idx, , drop = FALSE])
}

dat <- data.frame(x = 1:26, row.names = letters)
extract.with.context(dat, c("a", "b", "j", "y"), after = 3, before = 1)
#    x
# a  1
# b  2
# c  3
# d  4
# e  5
# i  9
# j 10
# k 11
# l 12
# m 13
# x 24
# y 25
# z 26

Solution 2:

Perhaps a combination of which() and %in% would help you:

dat[which(rownames(dat) %in% c("row.name13")) + c(-1, 1), ]
#            col1 col2 col3 col4
# row.name12    A   29    x    y
# row.name14    A   77    x    y

In the above, we are trying to identify which row names in "dat" are "row.name13" (using which()), and the + c(-1, 1) tells R to return the row before and the row after. If you wanted to include the row, you could do something like + c(-1:1).

To get the range of rows, switch the comma to a colon:

dat[which(rownames(dat) %in% c("row.name13")) + c(-1:1), ]
#            col1 col2 col3 col4
# row.name12    A   29    x    y
# row.name13    B   17    x    y
# row.name14    A   77    x    y

Update

Matching a list is a little bit trickier, but without thinking about it too much, here is a possibility:

myRows <- c("row.name12", "row.name13")
rowRanges <- lapply(which(rownames(dat) %in% myRows), function(x) x + c(-1:1))
# [[1]]
# [1] 1 2 3
# 
# [[2]]
# [1] 2 3 4
#
lapply(rowRanges, function(x) dat[x, ])
# [[1]]
#            col1 col2 col3 col4
# row.name11    A   23    x    y
# row.name12    A   29    x    y
# row.name13    B   17    x    y
# 
# [[2]]
#            col1 col2 col3 col4
# row.name12    A   29    x    y
# row.name13    B   17    x    y
# row.name14    A   77    x    y

This outputs a list of data.frames which might be handy since you might have duplicated rows (as there are in this example).

Update 2: Using grep if it is more appropriate

Here is a variation of your question, one which would be less convenient to solve using the which()...%in% approach.

set.seed(1)
dat1 <- data.frame(ID = 1:25, V1 = sample(100, 25, replace = TRUE))
rownames(dat1) <- paste("rowname", sample(apply(combn(LETTERS[1:4], 2), 
                                               2, paste, collapse = ""), 
                                         25, replace = TRUE), 
                       sprintf("%02d", 1:25), sep = ".")
head(dat1)
#               ID V1
# rowname.AD.01  1 27
# rowname.AB.02  2 38
# rowname.AD.03  3 58
# rowname.CD.04  4 91
# rowname.AD.05  5 21
# rowname.AD.06  6 90

Now, imagine you wanted to identify the rows with AB and AC, but you don't have a list of the numeric suffixes.

Here's a little function that can be used in such a scenario. It borrows a little from @Spacedman to make sure that the rows returned are within the range of the data (as per @flodel's suggestion).

getMyRows <- function(data, matches, range) {
  rowMatches = lapply(unlist(lapply(matches, function(x)
    grep(x, rownames(data)))), function(y) y + range)
  rowMatches = lapply(rowMatches, function(x) x[x > 0 & x <= nrow(data)])
  lapply(rowMatches, function(x) data[x, ])
}

You can use it as follows (but I won't print the results here). First, specify the dataset, then the pattern(s) you want matched, then the range (in this example, three rows before and four rows after).

getMyRows(dat1, c("AB", "AC"), -3:4)

Applying it to the earlier example of matching row.name12 and row.name13, you can use it as follows: getMyRows(dat, c(12, 13), -1:1).

You can also modify the function to make it more general (for example, to specify matching with a column instead of row names).

Solution 3:

Create some sample data:

> dat=data.frame(col1=letters,col2=sample(26),col3=sample(letters))
> dat
   col1 col2 col3
1     a   26    x
2     b   12    i
3     c   15    v
...

Set our target vector (note I choose an edge case and overlapping cases), and find matching rows:

> target=c("a","e","g","s")
> match = which(dat$col1 %in% target)

Create sequences from -2 to +2 of the matches (adjust for your needs) and merge:

> getThese = unique(as.vector(mapply(seq,match-2,match+2)))
> getThese
 [1] -1  0  1  2  3  4  5  6  7  8  9 17 18 19 20 21

Fix the edge cases:

> getThese = getThese[getThese > 0 & getThese <= nrow(dat)]
> dat[getThese,]
   col1 col2 col3
1     a   26    x
2     b   12    i
3     c   15    v
4     d   22    d
5     e    2    j
6     f    9    l
7     g    1    w
8     h   21    n
9     i   17    p
17    q   18    a
18    r   10    m
19    s   24    o
20    t   13    e
21    u    3    k
> 

Remember our targets were a, e, g and s. You've now got those plus two rows above and two rows below for each, with no duplicates.

If you are using row names, just create 'match' from those. I was using a column.

I'd write a bunch more tests using the testthat package if this were my problem.

Solution 4:

Another option will be to use filter. In case stats::filter is masked e.g. by dplyr::filter you have to use stats::filter.

dat <- data.frame(x = seq_along(letters), row.names = letters)

i <- rownames(dat) %in% c("a", "b", "j", "y") #Get the matches

nAfter <- 3
nBefore <- 1

fi <- seq(-nBefore, nAfter)
n <- max(abs(x))
fi <- seq(-n, n) %in% fi

dat[head(tail(filter(c(rep(FALSE, n), i, rep(FALSE, n)), fi), -n), -n) > 0,, drop = FALSE]
#   x
#a  1
#b  2
#c  3
#d  4
#e  5
#i  9
#j 10
#k 11
#l 12
#m 13
#x 24
#y 25
#z 26