Cosmic Rays: what is the probability they will affect a program?
Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that probability is.
"Since 2-128 is 1 out of 340282366920938463463374607431768211456, I think we're justified in taking our chances here, even if these computations are off by a factor of a few billion... We're way more at risk for cosmic rays to screw us up, I believe."
Is this programmer correct? What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?
From Wikipedia:
Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.[15]
This means a probability of 3.7 × 10-9 per byte per month, or 1.4 × 10-15 per byte per second. If your program runs for 1 minute and occupies 20 MB of RAM, then the failure probability would be
60 × 20 × 1024²
1 - (1 - 1.4e-15) = 1.8e-6 a.k.a. "5 nines"
Error checking can help to reduce the aftermath of failure. Also, because of more compact size of chips as commented by Joe, the failure rate could be different from what it was 20 years ago.
Apparently, not insignificant. From this New Scientist article, a quote from an Intel patent application:
"Cosmic ray induced computer crashes have occurred and are expected to increase with frequency as devices (for example, transistors) decrease in size in chips. This problem is projected to become a major limiter of computer reliability in the next decade. "
You can read the full patent here.
Note: this answer is not about physics, but about silent memory errors with non-ECC memory modules. Some of errors may come from outer space, and some - from inner space of desktop.
There are several studies of ECC memory failures on large server farms like CERN clusters and Google datacenters. Server-class hardware with ECC can detect and correct all single bit errors, and detect many multi-bit errors.
We can assume that there is lot of non-ECC desktops (and non-ECC mobile smartphones). If we check the papers for ECC-correctable error rates (single bitflips), we can know silent memory corruptions rate on non-ECC memory.
Large scale CERN 2007 study "Data integrity": vendors declares "Bit Error Rate of 10-12 for their memory modules", "a observed error rate is 4 orders of magnitude lower than expected". For data-intensive tasks (8 GB/s of memory reading) this means that single bit flip may occur every minute (10-12 vendors BER) or once in two days (10-16 BER).
2009 Google's paper "DRAM Errors in the Wild: A Large-Scale Field Study" says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM after my calculations. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".
2012 Sandia report "Detection and Correction of Silent Data Corruptionfor Large-Scale High-Performance Computing": "double bit flips were deemed unlikely" but at ORNL's dense Cray XT5 they are "at a rate of one per day for 75,000+ DIMMs" even with ECC. And single-bit errors should be higher.
So, if the program has large dataset (several GB), or has high memory reading or writing rate (GB/s or more), and it runs for several hours, then we can expect up to several silent bit flips on desktop hardware. This rate is not detectable by memtest, and DRAM modules are good.
Long cluster runs on thousands of non-ECC PCs, like BOINC internet-wide grid computing will always have errors from memory bit-flips and also from disk and network silent errors.
And for bigger machines (10 thousands of servers) even with ECC protection from single-bit errors, as we see in Sandia's 2012 report, there can be double-bit flips every day, so you will have no chance to run full-size parallel program for several days (without regular checkpointing and restarting from last good checkpoint in case of double error). The huge machines will also get bit-flips in their caches and cpu registers (both architectural and internal chip's triggers e.g. in ALU datapath), because not all of them are protected by ECC.
PS: Things will be much worse if the DRAM module is bad. For example, I installed new DRAM into laptop, which died several weeks later. It started to give lot of memory errors. What I get: laptop hangs, linux reboots, runs fsck, finds errors on root filesystem and says that it want to do reboot after correcting errors. But at every next reboot (I did around 5-6 of them) there are still errors found on the root filesystem.
Wikipedia cites a study by IBM in the 90s suggesting that "computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month." Unfortunately the citation was to an article in Scientific American, which didn't give any further references. Personally, I find that number to be very high, but perhaps most memory errors induced by cosmic rays don't cause any actual or noticable problems.
On the other hand, people talking about probabilities when it comes to software scenarios typically have no clue what they are talking about.
Well, cosmic rays apparently caused the electronics in Toyota cars to malfunction, so I would say that the probability is very high :)
Are cosmic rays really causing Toyota woes?