ECC genuinely is the only check against memory bitflips in a typical system. Obviously, there’s other stuff that gets used in safety-critical or radiation-hardened systems, but those aren’t typical. Most software is written assuming that memory errors never happen, and checksumming is only used when there’s a network transfer or, less commonly, when data’s at rest on a hard drive or SSD for a long time (but most people are still running a filesystem with no redundancy beyond journaling, which is really meant for things like unexpected power loss).
There are things that mitigate the impact of memory errors on devices that can’t detect and correct them, but they’re not redundancies. They don’t keep everything working when a failure happens, instead just isolating a problem to a single process so you don’t lose unsaved work in other applications etc… The main things they’re designed to protect against are software bugs and malicious actors, not memory errors, it just happens to be the case that they work on other things, too.
Also, it looks like some of the confusion is because of a typo in my original comment where I said unrecoverable instead of recoverable. The figures that are around 10% per year are in the CE column, which is the correctable errors, i.e. a single bit that ECC puts right. The figures for unrecoverable/uncorrectable errors are in the UE column, and they’re around 1%. It’s therefore the 10% figure that’s relevant to consumer devices without ECC, with no need to extrapolate how many single bit flips would need to happen to cause 10% of machines to experience double bit flips.
ECC genuinely is the only check against memory bitflips in a typical system. Obviously, there’s other stuff that gets used in safety-critical or radiation-hardened systems, but those aren’t typical. Most software is written assuming that memory errors never happen, and checksumming is only used when there’s a network transfer or, less commonly, when data’s at rest on a hard drive or SSD for a long time (but most people are still running a filesystem with no redundancy beyond journaling, which is really meant for things like unexpected power loss).
There are things that mitigate the impact of memory errors on devices that can’t detect and correct them, but they’re not redundancies. They don’t keep everything working when a failure happens, instead just isolating a problem to a single process so you don’t lose unsaved work in other applications etc… The main things they’re designed to protect against are software bugs and malicious actors, not memory errors, it just happens to be the case that they work on other things, too.
Also, it looks like some of the confusion is because of a typo in my original comment where I said unrecoverable instead of recoverable. The figures that are around 10% per year are in the CE column, which is the correctable errors, i.e. a single bit that ECC puts right. The figures for unrecoverable/uncorrectable errors are in the UE column, and they’re around 1%. It’s therefore the 10% figure that’s relevant to consumer devices without ECC, with no need to extrapolate how many single bit flips would need to happen to cause 10% of machines to experience double bit flips.