Returning above and below rows of specific rows in r dataframe
Consider any dataframe
col1 col2 col3 col4
row.name11 A 23 x y
row.name12 A 29 x y
row.name13 B 17 x y
row.name14 A 77 x y
I have a list of rownames which I want to return from this dataframe. Lets say I have row.name12 and row.name13 in a list. I can easily return these rows from dataframe. But I also want to return 4 rows above and 4 rows below these rows. It means I want to return from row.name8 to row.name17. I think it is similar to grep -A -B
in shell.
Probable solution- Is there any way to return row number by row name? Because if I have row number than I can easily subtract 4 and add 4 in row number and return rows.
Note: Here rownames are just examples. Rownames could be anything like RED, BLUE, BLACK, etc.
Solution 1:
Try that:
extract.with.context <- function(x, rows, after = 0, before = 0) {
match.idx <- which(rownames(x) %in% rows)
span <- seq(from = -before, to = after)
extend.idx <- c(outer(match.idx, span, `+`))
extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
extend.idx <- sort(unique(extend.idx))
return(x[extend.idx, , drop = FALSE])
}
dat <- data.frame(x = 1:26, row.names = letters)
extract.with.context(dat, c("a", "b", "j", "y"), after = 3, before = 1)
# x
# a 1
# b 2
# c 3
# d 4
# e 5
# i 9
# j 10
# k 11
# l 12
# m 13
# x 24
# y 25
# z 26
Solution 2:
Perhaps a combination of which()
and %in%
would help you:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1, 1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name14 A 77 x y
In the above, we are trying to identify which row names in "dat" are "row.name13" (using which()
), and the + c(-1, 1)
tells R to return the row before and the row after. If you wanted to include the row, you could do something like + c(-1:1)
.
To get the range of rows, switch the comma to a colon:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1:1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
Update
Matching a list is a little bit trickier, but without thinking about it too much, here is a possibility:
myRows <- c("row.name12", "row.name13")
rowRanges <- lapply(which(rownames(dat) %in% myRows), function(x) x + c(-1:1))
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 3 4
#
lapply(rowRanges, function(x) dat[x, ])
# [[1]]
# col1 col2 col3 col4
# row.name11 A 23 x y
# row.name12 A 29 x y
# row.name13 B 17 x y
#
# [[2]]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
This outputs a list
of data.frame
s which might be handy since you might have duplicated rows (as there are in this example).
Update 2: Using grep
if it is more appropriate
Here is a variation of your question, one which would be less convenient to solve using the which()
...%in%
approach.
set.seed(1)
dat1 <- data.frame(ID = 1:25, V1 = sample(100, 25, replace = TRUE))
rownames(dat1) <- paste("rowname", sample(apply(combn(LETTERS[1:4], 2),
2, paste, collapse = ""),
25, replace = TRUE),
sprintf("%02d", 1:25), sep = ".")
head(dat1)
# ID V1
# rowname.AD.01 1 27
# rowname.AB.02 2 38
# rowname.AD.03 3 58
# rowname.CD.04 4 91
# rowname.AD.05 5 21
# rowname.AD.06 6 90
Now, imagine you wanted to identify the rows with AB
and AC
, but you don't have a list of the numeric suffixes.
Here's a little function that can be used in such a scenario. It borrows a little from @Spacedman to make sure that the rows returned are within the range of the data (as per @flodel's suggestion).
getMyRows <- function(data, matches, range) {
rowMatches = lapply(unlist(lapply(matches, function(x)
grep(x, rownames(data)))), function(y) y + range)
rowMatches = lapply(rowMatches, function(x) x[x > 0 & x <= nrow(data)])
lapply(rowMatches, function(x) data[x, ])
}
You can use it as follows (but I won't print the results here). First, specify the dataset, then the pattern(s) you want matched, then the range (in this example, three rows before and four rows after).
getMyRows(dat1, c("AB", "AC"), -3:4)
Applying it to the earlier example of matching row.name12
and row.name13
, you can use it as follows: getMyRows(dat, c(12, 13), -1:1)
.
You can also modify the function to make it more general (for example, to specify matching with a column instead of row names).
Solution 3:
Create some sample data:
> dat=data.frame(col1=letters,col2=sample(26),col3=sample(letters))
> dat
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
...
Set our target vector (note I choose an edge case and overlapping cases), and find matching rows:
> target=c("a","e","g","s")
> match = which(dat$col1 %in% target)
Create sequences from -2 to +2 of the matches (adjust for your needs) and merge:
> getThese = unique(as.vector(mapply(seq,match-2,match+2)))
> getThese
[1] -1 0 1 2 3 4 5 6 7 8 9 17 18 19 20 21
Fix the edge cases:
> getThese = getThese[getThese > 0 & getThese <= nrow(dat)]
> dat[getThese,]
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
4 d 22 d
5 e 2 j
6 f 9 l
7 g 1 w
8 h 21 n
9 i 17 p
17 q 18 a
18 r 10 m
19 s 24 o
20 t 13 e
21 u 3 k
>
Remember our targets were a, e, g and s. You've now got those plus two rows above and two rows below for each, with no duplicates.
If you are using row names, just create 'match' from those. I was using a column.
I'd write a bunch more tests using the testthat package if this were my problem.
Solution 4:
Another option will be to use filter
. In case stats::filter
is masked e.g. by dplyr::filter
you have to use stats::filter
.
dat <- data.frame(x = seq_along(letters), row.names = letters)
i <- rownames(dat) %in% c("a", "b", "j", "y") #Get the matches
nAfter <- 3
nBefore <- 1
fi <- seq(-nBefore, nAfter)
n <- max(abs(x))
fi <- seq(-n, n) %in% fi
dat[head(tail(filter(c(rep(FALSE, n), i, rep(FALSE, n)), fi), -n), -n) > 0,, drop = FALSE]
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#i 9
#j 10
#k 11
#l 12
#m 13
#x 24
#y 25
#z 26