How do I delete rows in a data frame?
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame
is called myData
:
myData[-c(2, 4, 6), ] # notice the -
Of course, don't forget to "reassign" myData
if you wanted to drop those rows entirely---otherwise, R just prints the results.
myData <- myData[-c(2, 4, 6), ]
You can also work with a so called boolean vector, aka logical
:
row_to_keep = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
myData = myData[row_to_keep,]
Note that the !
operator acts as a NOT, i.e. !TRUE == FALSE
:
myData = myData[!row_to_keep,]
This seems a bit cumbersome in comparison to @mrwab's answer (+1 btw :)), but a logical vector can be generated on the fly, e.g. where a column value exceeds a certain value:
myData = myData[myData$A > 4,]
myData = myData[!myData$A > 4,] # equal to myData[myData$A <= 4,]
You can transform a boolean vector to a vector of indices:
row_to_keep = which(myData$A > 4)
Finally, a very neat trick is that you can use this kind of subsetting not only for extraction, but also for assignment:
myData$A[myData$A > 4,] <- NA
where column A
is assigned NA
(not a number) where A
exceeds 4.
Problems with deleting by row number
For quick and dirty analyses, you can delete rows of a data.frame by number as per the top answer. I.e.,
newdata <- myData[-c(2, 4, 6), ]
However, if you are trying to write a robust data analysis script, you should generally avoid deleting rows by numeric position. This is because the order of the rows in your data may change in the future. A general principle of a data.frame or database tables is that the order of the rows should not matter. If the order does matter, this should be encoded in an actual variable in the data.frame.
For example, imagine you imported a dataset and deleted rows by numeric position after inspecting the data and identifying the row numbers of the rows that you wanted to delete. However, at some later point, you go into the raw data and have a look around and reorder the data. Your row deletion code will now delete the wrong rows, and worse, you are unlikely to get any errors warning you that this has occurred.
Better strategy
A better strategy is to delete rows based on substantive and stable properties of the row. For example, if you had an id
column variable that uniquely identifies each case, you could use that.
newdata <- myData[ !(myData$id %in% c(2,4,6)), ]
Other times, you will have a formal exclusion criteria that could be specified, and you could use one of the many subsetting tools in R to exclude cases based on that rule.
Create id column in your data frame or use any column name to identify the row. Using index is not fair to delete.
Use subset
function to create new frame.
updated_myData <- subset(myData, id!= 6)
print (updated_myData)
updated_myData <- subset(myData, id %in% c(1, 3, 5, 7))
print (updated_myData)
By simplified sequence :
mydata[-(1:3 * 2), ]
By sequence :
mydata[seq(1, nrow(mydata), by = 2) , ]
By negative sequence :
mydata[-seq(2, nrow(mydata), by = 2) , ]
Or if you want to subset by selecting odd numbers:
mydata[which(1:nrow(mydata) %% 2 == 1) , ]
Or if you want to subset by selecting odd numbers, version 2:
mydata[which(1:nrow(mydata) %% 2 != 0) , ]
Or if you want to subset by filtering even numbers out:
mydata[!which(1:nrow(mydata) %% 2 == 0) , ]
Or if you want to subset by filtering even numbers out, version 2:
mydata[!which(1:nrow(mydata) %% 2 != 1) , ]