distinct rows in R based on the order of other columns
I conducted a multi-period online experiment but there was partially duplicated and incomplete data.
Simply put, there are 2 trials (trial 1:2) in the online experiment and each trial contains 2 periods (period 1:2). Participants make a decision (1:5) to guess an unchanged nature (1:5) in 2 periods of a trial. After a trial, a nature changes randomly.
I found participants might get stuck in a period and have to redo the experiment, which results in duplicated and incomplete trials in my data.
For example:
id | decision | nature | period | trial |
---|---|---|---|---|
1000 | 1 | 5 | 1 | 1 |
1000 | 1 | 5 | 2 | 1 |
1000 | 1 | 5 | 1 | 2 |
1000 | 1 | 5 | 2 | 2 |
1000 | 1 | 5 | 1 | 3 |
1000 | 2 | 2 | 1 | 1 |
1000 | 3 | 2 | 2 | 1 |
1000 | 1 | 2 | 1 | 2 |
1000 | 3 | 2 | 2 | 2 |
1000 | 5 | 2 | 1 | 3 |
1000 | 1 | 2 | 2 | 3 |
As you can see, in the first attempt, trial 3 was incomplete because this participant was stuck and had to redo the experiment, which results in duplicated data.
I ran a distinct function r code: distinct(id, trial,period,.keep_all = TRUE), but I got this
id | decision | nature | period | trial |
---|---|---|---|---|
1000 | 1 | 5 | 1 | 1 |
1000 | 1 | 5 | 2 | 1 |
1000 | 1 | 5 | 1 | 2 |
1000 | 1 | 5 | 2 | 2 |
1000 | 1 | 5 | 1 | 3 |
1000 | 5 | 2 | 1 | 3 |
The different nature values in trial 3 indicates that the Distinct function mixed two different attempts of this participant. How can I use distinct or other functions in R to get the completed data of participants in the same attempt?
My desired output is to keep one set of complete trials (1:3) per participant, where the nature value is consistent in a trial, and eliminate all duplicated and incomplete trials.
Thanks in advance!
Solution 1:
Is this what you are looking for?
## data
data <- structure(list(id = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1000L), decision = c(1L, 1L, 1L,
1L, 1L, 2L, 3L, 1L, 3L, 5L, 1L), nature = c(5L, 5L, 5L, 5L, 5L,
2L, 2L, 2L, 2L, 2L, 2L), period = c(1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L), trial = c(1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 2L,
3L, 3L)), row.names = c(NA, -11L), class = "data.frame")
library(dplyr)
data %>%
mutate(rownum = 1:n()) %>%
group_by(id, trial, period) %>%
mutate(maxrownum = max(rownum)) %>%
filter(rownum == maxrownum) %>%
select(-c(rownum, maxrownum))
I have created an identifier for the row number. Assuming that your data is ordered by attempt, choosing the rows where the row number is equal to max(row number) picks up the last attempt for each (id, trial, period) triple.
Output:
# Groups: id, trial, period [6]
id decision nature period trial
<int> <int> <int> <int> <int>
1 1000 2 2 1 1
2 1000 3 2 2 1
3 1000 1 2 1 2
4 1000 3 2 2 2
5 1000 5 2 1 3
6 1000 1 2 2 3