Find duplicate values in R [duplicate]
I have a table with 21638 unique* rows:
vocabulary <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Vocabulary.txt", header=T)
This table has five columns, the first of which holds the respondent ID numbers. I want to check if any respondents appear twice, or if all respondents are unique.
To count unique IDs I can use
length(unique(vocabulary$id))
and to check if there are any duplicates I might do
length(unique(vocabulary$id)) == nrow(vocabulary)
which returns TRUE
, if there are no duplicates (which there aren't).
My question:
Is there a direct way to return the values or line numbers of duplicates?
Some further explanation:
There is an interpretation problem with using the function duplicated()
, because is only returns the duplicates in the strict sense, excluding the "originals". For example, sum(duplicated(vocabulary$id))
or dim(vocabulary[duplicated(vocabulary$id),])[1]
might return "5" as the number of duplicate rows. The problem is that if you only know the number of duplicates, you won't know how many rows they duplicate. Does "5" mean that there are five rows with one duplicate each, or that there is one row with five duplicates? And since you won't have the IDs or line numbers of the duplicates, you wouldn't have any means of finding the "originals".
*I know there are no duplicate IDs in this survey, but it is a good example, because using any of the answers given elsewhere to this question, like duplicated(vocabulary$id)
or table(vocabulary$id)
will output a haystack to your screen in which you'll be quite unable to find any possible rare duplicate needles.
Solution 1:
You could use table
, i.e.
n_occur <- data.frame(table(vocabulary$id))
gives you a data frame with a list of id
s and the number of times they occurred.
n_occur[n_occur$Freq > 1,]
tells you which id
s occurred more than once.
vocabulary[vocabulary$id %in% n_occur$Var1[n_occur$Freq > 1],]
returns the records with more than one occurrence.
Solution 2:
This will give you duplicate rows:
vocabulary[duplicated(vocabulary$id),]
This will give you the number of duplicates:
dim(vocabulary[duplicated(vocabulary$id),])[1]
Example:
vocabulary2 <-rbind(vocabulary,vocabulary[1,]) #creates a duplicate at the end
vocabulary2[duplicated(vocabulary2$id),]
# id year sex education vocabulary
#21639 20040001 2004 Female 9 3
dim(vocabulary2[duplicated(vocabulary2$id),])[1]
#[1] 1 #=1 duplicate
EDIT
OK, with the additional information, here's what you should do: duplicated
has a fromLast
option which allows you to get duplicates from the end. If you combine this with the normal duplicated
, you get all duplicates. The following example adds duplicates to the original vocabulary object (line 1 is duplicated twice and line 5 is duplicated once). I then use table
to get the total number of duplicates per ID.
#Create vocabulary object with duplicates
voc.dups <-rbind(vocabulary,vocabulary[1,],vocabulary[1,],vocabulary[5,])
#List duplicates
dups <-voc.dups[duplicated(voc.dups$id)|duplicated(voc.dups$id, fromLast=TRUE),]
dups
# id year sex education vocabulary
#1 20040001 2004 Female 9 3
#5 20040008 2004 Male 14 1
#21639 20040001 2004 Female 9 3
#21640 20040001 2004 Female 9 3
#51000 20040008 2004 Male 14 1
#Count duplicates by id
table(dups$id)
#20040001 20040008
# 3 2
Solution 3:
Here, I summarize a few ways which may return different results to your question, so be careful:
# First assign your "id"s to an R object.
# Here's a hypothetical example:
id <- c("a","b","b","c","c","c","d","d","d","d")
#To return ALL MINUS ONE duplicated values:
id[duplicated(id)]
## [1] "b" "c" "c" "d" "d" "d"
#To return ALL duplicated values by specifying fromLast argument:
id[duplicated(id) | duplicated(id, fromLast=TRUE)]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"
#Yet another way to return ALL duplicated values, using %in% operator:
id[ id %in% id[duplicated(id)] ]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"
Hope these help. Good luck.