Create a table from dataframe column values mean and standard deviation in R

I am new to R so no idea about the code. I have two data frames. One dataframe looks like this.

df

ID	Disease
GSM239170	Control
GSM239323	Control
GSM239324	Control
GSM239326	Control
GSM239328	AML
GSM239329	AML
GSM239331	AML
GSM239332	Control
GSM239333	Control

And the other dataframe looks like this:

df1

GSM239170	GSM239323	GSM239324	GSM239326	GSM239328	GSM239329	GSM239331	GSM239332	GSM239333
3.016704177	3.285669072	2.929482692	2.922820483	3.15950317	3.163327169	2.985901308	3.122708843	3.070948463
7.977735461	6.532514237	6.388007183	6.466679556	6.432795021	6.407321524	6.426470803	6.376394357	6.469070308
4.207280707	4.994965767	4.40159671	4.747114589	4.830045513	4.213762092	4.884418365	4.4318876	4.849665444
7.25609471	7.420807337	6.999340125	7.094488581	7.024332721	7.17928981	7.159898654	7.009977785	6.830979234
2.204955099	2.331625217	2.133305231	2.18332885	2.12778313	2.269697813	2.264705552	2.253940441	2.287924323
7.28437278	6.983593721	6.86337111	6.865970678	7.219840938	7.181113053	7.392230178	7.484052914	7.52498281
4.265792764	4.970684112	4.595545125	4.575545289	4.547957809	4.68215122	4.674495889	4.675841709	4.643311767
2.6943516	2.916324936	2.578130269	2.659717988	2.567436676	2.8095128	2.790110381	2.795882913	2.884588792
3.646303109	8.817891552	11.4248793	10.74738082	9.296043108	9.53150669	8.285160496	9.769919327	9.774610531
3.040292001	3.38486713	2.958851115	3.047880699	2.878562717	3.209319974	3.20260379	3.195993624	3.3004227
2.357625231	2.444753172	2.340767158	2.32143889	2.282608342	2.401218719	2.385568421	2.375334953	2.432634747
5.378494673	6.065038394	5.134842087	5.367342376	5.682051149	5.712072512	5.57179966	5.72082395	5.656674512
2.833814735	3.038434511	2.837711812	2.859800224	2.866040813	2.969167906	2.929449968	2.963530689	2.931065261
6.192932281	6.478439634	6.180169144	6.151689376	6.238949956	6.708196123	6.441437631	6.448280595	6.413562269
4.543042482	4.786227217	4.445131477	4.51471011	4.491645167	4.460114204	4.602482637	4.587221948	4.623125028
6.069437462	6.232738284	6.74644117	7.04995802	6.938928532	6.348253102	6.080950712	6.324619355	6.472893789

I want to make a table to include mean_AML, sd_AML (standard deviation), min_AML, max_AML, mean_Control, sd_Control, min_Control, max_Control, and Fold_change (i.e, mean_AML – mean_Control) for each gene. It is fine to use built-in functions.

Can't figure out the way how I can do this. Please help.

Thanks.

--- UPDATE ---

Hints: split the dataset into AML data and normal data sets, and then for each gene/probeset, calculate its mean, standard derivation, min and max expression values across samples separately (using a built-in function), and further merge these statistical values for each gene into one table. Apply data.frame() and give the created table the same row names as the gene expression data table.

Another option with old function tidyr::gather to have a column df:

library(tidyverse)

df2_spread <- df1 %>% 
  tidyr::gather(ID, val) %>% 
  left_join(df, by = 'ID')
  
result_1 <- df2_spread %>% 
  group_by(Disease, gene = ID) %>% 
  summarise(n = n(),
            mean = mean(val),
            sd = sd(val),
            min = min(val),
            max = max(val), .groups = "drop")

 A tibble: 9 × 7
  Disease gene          n  mean    sd   min   max
  <chr>   <chr>     <int> <dbl> <dbl> <dbl> <dbl>
1 AML     GSM239328    16  4.91  2.15  2.13  9.30
2 AML     GSM239329    16  4.95  2.13  2.27  9.53
3 AML     GSM239331    16  4.88  1.96  2.26  8.29
4 Control GSM239170    16  4.56  1.91  2.20  7.98
5 Control GSM239323    16  5.04  1.98  2.33  8.82
6 Control GSM239324    16  4.93  2.45  2.13 11.4 
7 Control GSM239326    16  4.97  2.34  2.18 10.7 
8 Control GSM239332    16  4.97  2.16  2.25  9.77
9 Control GSM239333    16  5.01  2.14  2.29  9.77

In any case I'm not able to find a way to calculate Fold_change for each gene since there seems to be only one disease by gene here.

Here are the datas


df <- tibble::tribble(
          ~ID,  ~Disease,
  "GSM239170", "Control",
  "GSM239323", "Control",
  "GSM239324", "Control",
  "GSM239326", "Control",
  "GSM239328",     "AML",
  "GSM239329",     "AML",
  "GSM239331",     "AML",
  "GSM239332", "Control",
  "GSM239333", "Control"
  )



df1 <- tibble::tribble(
   ~GSM239170,  ~GSM239323,  ~GSM239324,  ~GSM239326,  ~GSM239328,  ~GSM239329,  ~GSM239331,  ~GSM239332,  ~GSM239333,
  3.016704177, 3.285669072, 2.929482692, 2.922820483,  3.15950317, 3.163327169, 2.985901308, 3.122708843, 3.070948463,
  7.977735461, 6.532514237, 6.388007183, 6.466679556, 6.432795021, 6.407321524, 6.426470803, 6.376394357, 6.469070308,
  4.207280707, 4.994965767,  4.40159671, 4.747114589, 4.830045513, 4.213762092, 4.884418365,   4.4318876, 4.849665444,
   7.25609471, 7.420807337, 6.999340125, 7.094488581, 7.024332721,  7.17928981, 7.159898654, 7.009977785, 6.830979234,
  2.204955099, 2.331625217, 2.133305231,  2.18332885,  2.12778313, 2.269697813, 2.264705552, 2.253940441, 2.287924323,
   7.28437278, 6.983593721,  6.86337111, 6.865970678, 7.219840938, 7.181113053, 7.392230178, 7.484052914,  7.52498281,
  4.265792764, 4.970684112, 4.595545125, 4.575545289, 4.547957809,  4.68215122, 4.674495889, 4.675841709, 4.643311767,
    2.6943516, 2.916324936, 2.578130269, 2.659717988, 2.567436676,   2.8095128, 2.790110381, 2.795882913, 2.884588792,
  3.646303109, 8.817891552,  11.4248793, 10.74738082, 9.296043108,  9.53150669, 8.285160496, 9.769919327, 9.774610531,
  3.040292001,  3.38486713, 2.958851115, 3.047880699, 2.878562717, 3.209319974,  3.20260379, 3.195993624,   3.3004227,
  2.357625231, 2.444753172, 2.340767158,  2.32143889, 2.282608342, 2.401218719, 2.385568421, 2.375334953, 2.432634747,
  5.378494673, 6.065038394, 5.134842087, 5.367342376, 5.682051149, 5.712072512,  5.57179966,  5.72082395, 5.656674512,
  2.833814735, 3.038434511, 2.837711812, 2.859800224, 2.866040813, 2.969167906, 2.929449968, 2.963530689, 2.931065261,
  6.192932281, 6.478439634, 6.180169144, 6.151689376, 6.238949956, 6.708196123, 6.441437631, 6.448280595, 6.413562269,
  4.543042482, 4.786227217, 4.445131477,  4.51471011, 4.491645167, 4.460114204, 4.602482637, 4.587221948, 4.623125028,
  6.069437462, 6.232738284,  6.74644117,  7.04995802, 6.938928532, 6.348253102, 6.080950712, 6.324619355, 6.472893789
  )

We could combine pivot_longer with right_join and then use summarise on the group:

library(dplyr)
library(tidyr)
df1 %>% 
  pivot_longer(
    everything(),
    names_to = "ID", 
    values_to = "value"
  ) %>% 
  right_join(df, by="ID") %>% 
  group_by(Disease) %>% 
  summarise(Min = min(value), Mean = mean(value), Max = max(value), Sd = sd(value)) %>%
  ungroup()

  Disease   Min  Mean   Max    Sd
  <chr>   <dbl> <dbl> <dbl> <dbl>
1 AML      2.13  4.91  9.53  2.04
2 Control  2.13  4.92 11.4   2.12

Create a table from dataframe column values mean and standard deviation in R

df

df1

--- UPDATE ---

Related

Recent Posts