How much storage would be required to store a human genome?
I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd like to see how others would approach this issue.
An alternative question would be how many atoms are there in human DNA, but that would be off topic for this site.
I understand that this will be an approximation, so I'm looking for the minimal value that would be able to store DNA of any human.
If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content):
The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.
You do not store all the DNA in one stream, rather most the time it is store by chromosomes.
A large chromosome take about 300 MB and a small one about 50 MB.
Edit:
I think the first reason why it is not saved in 2 bits per base pair is that it would cause an hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs...
1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.
Another point is that the data isn't as simple as you get told.
e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on...
Another example is the DNA methylation because you can't store this Information in a 2-bit representation.
Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits ~= 691 megabytes.
I'm no expert, however, the Human Genome page on Wikipedia states the following:
Raw MB:
- Male (XY): 770MB
- Female (XX): 756MB
I'm not sure where their variance comes from, but I'm sure you can figure it out.