How to calculate tertiles in R?

I have a dataset that is not normally distributed and contains a lot of 0 values. I now want to calculate the tertiles for each column.

df <- tribble(
  ~shop,   ~products,  ~sales,  
    'A',      300,       100,                
    'B',      0,           0, 
    'C',      10,       2000, 
    'D',      0,           0, 
    'E',      0,           0, 
    'F',      0,           0, 
    'G',      20,         10, 
    'H',      0,           0, 
    'J',      700,        50, 
    'K',      0,           0, 
)

Thanks to @AlexB 's answer to that question, I've tried calculating the tertiles with the following code:

df %>%
  arrange(products) %>% 
  mutate(tertiles = ntile(products, 3)) %>%
  mutate(tertiles = if_else(tertiles == 1, 'Low', if_else(tertiles == 2, 'Medium', 'High')))

However, even though the values are 0, the output is turning to 'high'. How can I calculate it more accurately?


I think what you're looking for could be achieved by using cut instead of ntile. Use the breaks argument in cut to define the limits for the three labels, and the labels argument to specify the labels themselves.

df %>%
  arrange(products) %>% 
  mutate(tertile = cut(products, 
                       breaks = c(-1, 1, 100, Inf), 
                       labels = c("low", "medium", "high")))

#> # A tibble: 10 x 4
#>    shop  products sales tertile
#>    <chr>    <dbl> <dbl> <fct>  
#>  1 B            0     0 low    
#>  2 D            0     0 low    
#>  3 E            0     0 low    
#>  4 F            0     0 low    
#>  5 H            0     0 low    
#>  6 K            0     0 low    
#>  7 C           10  2000 medium 
#>  8 G           20    10 medium 
#>  9 A          300   100 high   
#> 10 J          700    50 high

Addendum

To apply the same method to each column, we can do:

f <- function(x) cut(x, c(-1, 1, 100, Inf), c("low", "medium", "high"))

df %>%
  arrange(products) %>% 
  mutate(across(c("products", "sales"), .fns = f, .names = "{.col}_tertile"))
#> # A tibble: 10 x 5
#>    shop  products sales products_tertile sales_tertile
#>    <chr>    <dbl> <dbl> <fct>            <fct>        
#>  1 B            0     0 low              low          
#>  2 D            0     0 low              low          
#>  3 E            0     0 low              low          
#>  4 F            0     0 low              low          
#>  5 H            0     0 low              low          
#>  6 K            0     0 low              low          
#>  7 C           10  2000 medium           high         
#>  8 G           20    10 medium           medium       
#>  9 A          300   100 high             medium       
#> 10 J          700    50 high             medium

Created on 2022-01-23 by the reprex package (v2.0.1)


Update:

Somehow it is obvious that I was the first with the correct solution. But Allan Cameron finished it perfectly. So this is okay as I learned many things from Allan Cameron:

To give my final solution:

df %>%
  mutate(across(c(products, sales), ~cut(., breaks = 3, labels = c("low", "medium", "high")), .names = "tertile_{.col}"))
  shop  products sales tertile_products tertile_sales
   <chr>    <dbl> <dbl> <fct>            <fct>        
 1 A          300   100 medium           low          
 2 B            0     0 low              low          
 3 C           10  2000 low              high         
 4 D            0     0 low              low          
 5 E            0     0 low              low          
 6 F            0     0 low              low          
 7 G           20    10 low              low          
 8 H            0     0 low              low          
 9 J          700    50 high             low          
10 K            0     0 low              low  

First answer: For column products:

df %>%
  arrange(products) %>% 
  mutate(tertiles = cut(products, breaks = 3, labels = c(1:3))) %>% 
  mutate(tertiles = case_when(tertiles==1 ~ "Low",
                              tertiles==2 ~ "Medium",
                              tertiles==3 ~ "High",
                              TRUE ~NA_character_))
   shop  products sales tertiles
   <chr>    <dbl> <dbl> <chr>   
 1 B            0     0 Low     
 2 D            0     0 Low     
 3 E            0     0 Low     
 4 F            0     0 Low     
 5 H            0     0 Low     
 6 K            0     0 Low     
 7 C           10  2000 Low     
 8 G           20    10 Low     
 9 A          300   100 Medium  
10 J          700    50 High