Why do SSDs tend to fail much more suddenly than HDDs?

You have most of the points correct, but here's a few that possibly explains your question:

SSDs are only generally faster than HDDs. The largest capacity hard drives actually rival the low and mid range SSDs in speed. Hard drive bandwidth tends to be proportional to data density. Write speed on SSDs can actually be twice as slow as read speeds.

SSDs and HDDs have very different wear issues and failure modes. In general, HDD wear is from start up and stop, and continuous hours spinning, and their wear is only statistically predictable -- some drives can easily last 5-10x their warranty period. The spinning itself doesn't kill the HDD, but some random wear event influenced by the spinning that may or may not occur. Most current SSDs have a limited number of write cycles per block, and when they reach that count, they will fail very soon after. Enterprise grade SSDs are rated in max total writes per day over the warranty period.

Both SSDs and HDDs include an internal failure prediction system. In linux, the smartmontools with the smartctl utility can read the SMART data from most of both of these drive types. Some SSDs instead need the nvme-cli package which extracts similar data. Similar tools should be available in windows, and some disk vendors also have their own tools. Using these tools, you can usually detect impending failure long before the drives fail. Some SSDs actually will tell you their exact health (as a percentage of write cycles used so far). In those cases, a new SSD will tell you how soon it will fail. With a HDD, it can't tell you when it will fail, it can only tell you if a random failure event has already occurred and the drive is already failing.

In my experience, hard drives sometimes give a week or two of warning before catastrophic failure, and this comes in the form of increasing occurrence of uncorrectable I/O errors. Typically, a flake from a wear spot floats around inside and scratches the rest of the drive, and it fails fairly suddenly with very little warning before data loss has already started. Spindle failure in disks is rare, and probably a manufacturing defect.

Most SSDs use something called wear leveling, which reuses blocks that haven't been written in a while by copying their data to another block and then writing new data to it, so that all blocks get the same number of writes. This extends the life of the drive, but the result is that you get no errors until it does fail and it fails evenly and all at once. It doesn't get "wear spots". But the drive knows how close it is to failing from the very beginning, and if you ask it, it will tell you.

As already pointed out, if you drop a HDD, that may kill it instantly. So with all this, to say an SSD fails instantly where a HDD warns you is a bit backwards. Of course, either one can fail suddenly with a controller failure.

SSDs do not work with individual bits. They work in blocks, typically 2048 bytes, and when you write a block, it has to find an unused block or it has to relocate an under used block and then replace it. SSDs are actually less fine grained that disks, so you've got that backwards too. But really block size is an implementation detail, and has gotten bigger over the years for disks too.

The typical reason for SSD "sudden-death" can be attributed to the functional requirements for their operation. @LawrenceC touched on this briefly in their answer, but I'll expand on it here.

All SSDs require certain metadata to function:

Controller firmware to establish baseline functionality
Microcode updates to alter functionality or encryption keys
Wear-leveling data used by the Flash Translation Layer
S.M.A.R.T. attributes, etc

All of the above are stored in the so-called system area of the NAND flash. Roman Morozov, a data recovery specialist at ACELab, provides helpful details on an excellent blog post at Elcomsoft (emphasis are mine):

The system area contains SSD firmware (the microcode to boot the controller) and system structures. The size of the system area is in the range of 4 to 12 GB. In this area, the SSD controller stores system structures called “modules”. Modules contain essential data such as translation tables, parts of microcode that deal with the media encryption key, SMART attributes and so on.

The translation table—ironically, the very mechanism which enables wear-leveling—cannot be wear-leveled:

If you have read our previous article, you are aware of the fact that SSD drives actively remap addresses of logical blocks, pointing the same logical address to various physical NAND cells in order to level wear and boost write speeds. Unfortunately, in most (all?) SSD drives the physical location of the system area must remain constant. It cannot be remapped; wear leveling is not applicable to at least some modules in the system area. This in turn means that a constant flow of individual write operations, each modifying the content of the translation table, will write into the same physical NAND cells over and over again...

Since the NAND System Area cannot be wear-leveled, it experiences much higher stress than the data area. This is exacerbated by frequent small writes:

Such usage scenarios will cause premature wear on the system area without any meaningful indication in any SMART parameters. As a result, a perfectly healthy SSD with 98-99% of remaining lifespan can suddenly disappear from the system. At this point, the SSD controller cannot perform successful ECC corrections of essential information stored in the system area. The SSD disappears from the computer’s BIOS or appears as empty/uninitialized/unformatted media.

This can manifest itself as completely dead drive if the controller can't boot the firmware:

If the SSD drive does not appear in the computer’s BIOS, it may mean its controller is in a bootloop. Internally, the following cyclic process occurs. The controller attempts to load microcode from NAND chips into the controller’s RAM; an error occurs; the controller retries; an error occurs; etc.

But more often than not, the data will simply disappear because the FTL has failed:

However, the most frequent point of failure are errors in the translation module that maps physical blocks to logical addresses. If this error occurs, the SSD will be recognized as a device in the computer’s BIOS. However, the user will be unable to access information; the SSD will appear as uninitialized (raw) media, or will advertise a significantly smaller storage capacity (e.g. 2MB instead of the real capacity of 960GB).

So the sudden failure of SSDs has always been a result of limited write endurance—but of the drive metadata, not the user data.

Since the advent of IDE drives, and likely even before, each hard drive is basically an embedded computer platform--it has a CPU, RAM, runs firmware, the firmware talks to the PC over a bus/port and also talks to the device electronics over internal I/O interfaces. The firmware likely either uses the storage medium itself to store things or additional flash on the circuit board.

Unless you are somehow "inside" of this embedded computer, there's really no ability to tell the exact reason why a hard drive has failed, unless you have obvious external signs like noises/platter scratching sounds. For SSDs you don't have these. SMART gives you some abstracted information and clues but nothing like a "crash dump" to analyze at the moment of failure.

Now there are likely hidden/undocumented service interfaces on devices used during manufacturing and servicing, but without a lot of reverse engineering work, no one's going to know about them and how to use them, and these techniques likely vary greatly depending on manufacturer, model, series, etc.

Yet, despite all this, HDDs usually fail gradually, while SSDs usually fail all at once.

Given the above, honestly we can really only speculate. Some information for speculation:

NAND works vastly different than platters. NAND needs to be erased before being used, erase-blocks are larger than data blocks, and writes wear out the flash. None of this applies to platters.
SSDs attached to the same interfaces that HDDs work with could not be honest about how they worked, they needed to look like platter HDDs.
So both of the above mean that SSD firmware from the outset has to be more complex. More complex = more opportunities for failure.
If device firmware runs into a bad condition while booting and can't boot, or simply refuses to boot, the result is a dead device. Overall the chance for this increases due to the additional complexity we infer from the additional tasks we know SSD firmware is doing.
Possible exact reasons for this are myriad and you have no further visibility for a specific failure unless there is access to device-level diagnostic/service interfaces.
Ideally an SSD with no further ability to write would simply become read only. Unfortunately there are no standards here and nothing stopping an SSD manufacturer from putting firmware out there that simply gives up and refuses to boot if it's out of spare flash space to support additional writes.
There are "legitimate" things that can happen where the firmware really doesn't have much of a choice except to "die" - e.g. if an entire flash chip just died or stopped responding (most SSDs have many of these), if the SSDs internal RAM became bad, if various fixed locations on internal flash the SSD uses to boot or do basic tasks dies, etc. Do these things happen more often than the equivalents with platter drives? Who knows. I wouldn't even know where to go to get that data.

I've heard of some SSDs that have UARTs on what used to be the jumper pins on classic HDDs. Perhaps when the devices don't boot, something is output there.

But overall, the main point is, with most devices, we simply have no way to tell why SSDs suddenly stop booting, and given that SSD firmware has to be more complex and time-to-market is everything with new technology, we shouldn't be surprised that total failure happens more often.

Why do SSDs tend to fail much more suddenly than HDDs?

Related

Recent Posts