How to calculate tertiles in R?
I have a dataset that is not normally distributed and contains a lot of 0 values. I now want to calculate the tertiles for each column.
df <- tribble(
~shop, ~products, ~sales,
'A', 300, 100,
'B', 0, 0,
'C', 10, 2000,
'D', 0, 0,
'E', 0, 0,
'F', 0, 0,
'G', 20, 10,
'H', 0, 0,
'J', 700, 50,
'K', 0, 0,
)
Thanks to @AlexB 's answer to that question, I've tried calculating the tertiles with the following code:
df %>%
arrange(products) %>%
mutate(tertiles = ntile(products, 3)) %>%
mutate(tertiles = if_else(tertiles == 1, 'Low', if_else(tertiles == 2, 'Medium', 'High')))
However, even though the values are 0, the output is turning to 'high'. How can I calculate it more accurately?
I think what you're looking for could be achieved by using cut
instead of ntile
. Use the breaks
argument in cut
to define the limits for the three labels, and the labels
argument to specify the labels themselves.
df %>%
arrange(products) %>%
mutate(tertile = cut(products,
breaks = c(-1, 1, 100, Inf),
labels = c("low", "medium", "high")))
#> # A tibble: 10 x 4
#> shop products sales tertile
#> <chr> <dbl> <dbl> <fct>
#> 1 B 0 0 low
#> 2 D 0 0 low
#> 3 E 0 0 low
#> 4 F 0 0 low
#> 5 H 0 0 low
#> 6 K 0 0 low
#> 7 C 10 2000 medium
#> 8 G 20 10 medium
#> 9 A 300 100 high
#> 10 J 700 50 high
Addendum
To apply the same method to each column, we can do:
f <- function(x) cut(x, c(-1, 1, 100, Inf), c("low", "medium", "high"))
df %>%
arrange(products) %>%
mutate(across(c("products", "sales"), .fns = f, .names = "{.col}_tertile"))
#> # A tibble: 10 x 5
#> shop products sales products_tertile sales_tertile
#> <chr> <dbl> <dbl> <fct> <fct>
#> 1 B 0 0 low low
#> 2 D 0 0 low low
#> 3 E 0 0 low low
#> 4 F 0 0 low low
#> 5 H 0 0 low low
#> 6 K 0 0 low low
#> 7 C 10 2000 medium high
#> 8 G 20 10 medium medium
#> 9 A 300 100 high medium
#> 10 J 700 50 high medium
Created on 2022-01-23 by the reprex package (v2.0.1)
Update:
Somehow it is obvious that I was the first with the correct solution. But Allan Cameron finished it perfectly. So this is okay as I learned many things from Allan Cameron:
To give my final solution:
df %>%
mutate(across(c(products, sales), ~cut(., breaks = 3, labels = c("low", "medium", "high")), .names = "tertile_{.col}"))
shop products sales tertile_products tertile_sales
<chr> <dbl> <dbl> <fct> <fct>
1 A 300 100 medium low
2 B 0 0 low low
3 C 10 2000 low high
4 D 0 0 low low
5 E 0 0 low low
6 F 0 0 low low
7 G 20 10 low low
8 H 0 0 low low
9 J 700 50 high low
10 K 0 0 low low
First answer:
For column products
:
df %>%
arrange(products) %>%
mutate(tertiles = cut(products, breaks = 3, labels = c(1:3))) %>%
mutate(tertiles = case_when(tertiles==1 ~ "Low",
tertiles==2 ~ "Medium",
tertiles==3 ~ "High",
TRUE ~NA_character_))
shop products sales tertiles
<chr> <dbl> <dbl> <chr>
1 B 0 0 Low
2 D 0 0 Low
3 E 0 0 Low
4 F 0 0 Low
5 H 0 0 Low
6 K 0 0 Low
7 C 10 2000 Low
8 G 20 10 Low
9 A 300 100 Medium
10 J 700 50 High