Categorize continuous variable with dplyr [duplicate]
I want to create a new variable with 3 arbitrary categories based on continuous data.
set.seed(123)
df <- data.frame(a = rnorm(100))
Using base I would
df$category[df$a < 0.5] <- "low"
df$category[df$a > 0.5 & df$a < 0.6] <- "middle"
df$category[df$a > 0.6] <- "high"
Is there a dplyr, I guess mutate()
, solution for this?
Furthermore, is there a way to calculate the categories rather than choosing them? I.e. let R calculate where the breaks for the categories should be.
EDIT
The answer is in this thread, however, it does not involve labelling, which confused me (and may confuse others) therefore I think this question serves a purpose.
To convert from numeric to categorical, use cut
. In your particular case, you want:
df$category <- cut(df$a,
breaks=c(-Inf, 0.5, 0.6, Inf),
labels=c("low","middle","high"))
Or, using dplyr
:
library(dplyr)
res <- df %>% mutate(category=cut(a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high")))
## a category
##1 -0.560475647 low
##2 -0.230177489 low
##3 1.558708314 high
##4 0.070508391 low
##5 0.129287735 low
## ...
##35 0.821581082 high
##36 0.688640254 high
##37 0.553917654 middle
##38 -0.061911711 low
##39 -0.305962664 low
##40 -0.380471001 low
## ...
##96 -0.600259587 low
##97 2.187332993 high
##98 1.532610626 high
##99 -0.235700359 low
##100 -1.026420900 low
using quantiles
for cut
xs=quantile(df$a,c(0,1/3,2/3,1))
xs[1]=xs[1]-.00005
df1 <- df %>% mutate(category=cut(a, breaks=xs, labels=c("low","middle","high")))
boxplot(df1$a~df1$category,col=3:5)