Do bad clusters on modern (as of 2020) SSD indicate hardware failure?

I've got a Samsung 970 Evo 2TB SSD that has been working perfectly for a year and a half, under a mostly-read load.

Recently, while updating Windows to the 2004 release, the OS got stuck in the rollback loop. Trying to fix it with the command prompt, I ran into chkdsk finding a lot of bad clusters, with the infamous "unspecified 75736..." error in WinRE where chkdsk fails to fix the clusters.

Scanning the drive from a Windows install on a separate drive (that I'm now running) revealed and fixed a large number of bad clusters both on the Windows partition (explained by the bad update) and on a second partition that only ever stored third-party software (not explained by the update). The SMART is perfect. Some repeated scans failed to fix bad blocks again with the same error.

On a HDD, I'd expect that to indicate the disk going bad. But my understanding is that modern SSD manage their good and bad blocks internally (to a much greater extent than modern HDD do), as NAND memory is slowly failing all the time over the course of normal operations, and SSD wouldn't work as normal drives without this layer. It seems strange for the OS to find errors the controller can't.

It's one of the fastest PCI-E SSD, and at $500 the simple HDD maxim "if in doubt, toss it" doesn't apply. I'd prefer to try and keep the drive usable, if it is. There is a manufacturer warranty, but proving that the drive is defective, if it is, is difficult with a perfect SMART. The data is not a concern, what's intact has been copied. I'm also interested in general knowledge on the subject.

At this point in time, are bad clusters on SSD still a sign of hardware problems, or do modern controllers hide such failures from the end user?

Dead blocks that don't show in SMART are usually due to a faulty/bad quality SATA cable (not that unusual) or the SATA controller on the motherboard (very rare and usually goes with other stability issues as well).

You could try replacing the SATA cable to see if that helps.

But normally, by the time the OS sees "bad bocks", the disk is past dead.

Doesn't matter if it s a HDD or a SSD.

The drive firmware is supposed to re-map bad blocks to its internal stock of "spare blocks". If the OS sees bad blocks that stock is already exhausted or there is an internal problem with the drive firmware making it unable to do the re-map.
Anyway, the disk can't be considered reliable anymore.

As the SSD is just 18 months old it may still be under factory warranty so contact Samsung technical support. They may be able to make a better diagnosis.

But anyway.... I would copy my important data to another disk and replace this SSD asap. To me my data is more valuable than the price of another SSD.

And please do a fresh Windows install. (Regardless whether you keep the drive or replace it.)
There is no telling what files are damaged and how reliable Windows and other software are now. Big chance that there are all sorts of hidden problems, that don't show straight away, but will keep haunting you for months to come if you don't do a re-install.

Do bad clusters on modern (as of 2020) SSD indicate hardware failure?

At this point in time, are bad clusters on SSD still a sign of hardware problems, or do modern controllers hide such failures from the end user?

2021, there are plenty of occurrences of both. They can and will slip through the cracks, the frequency just depends on the quality. Both because mid-to-lower tier laptops always use older hardware and because some new devices are made on the cheap. For instance, a "Plus" drive vs. an "Ultra 3D" from the same manufacturer, will only have a 10% in price difference but will have a huge difference in the rate of hardware undetected faults.

Also, even a manufacturer that makes good products has a bad batch every so often (i.e. certain flagship cell phones that would catch on fire when using third party or airplane chargers, for instance).

It seems strange for the OS to find errors the controller can't.

I find that it happens much more on really large files, I'm guessing the controllers are limited in their memory or ability to recognize what is a file after a certain point. What's worse is that if you have more than one bad cluster (usually noncontiguous) in a single multi-GB file, you have to run chkdsk multiple times to pick all of them up. I believe this seems to be a shortcoming of chkdsk.

Some repeated scans failed to fix bad blocks again with the same error.

I've ran into three situations:

You can test to see if you're "making progress" by trying to copy a file after you scanned it. Usually every time you run chkdsk you'll be able to copy a little more of the file before it errors out again. For my last time this happened for me it was a 37.8GB file that took 7 scans until I could copy it to safety. The first time I ran it: Adding 2964 bad clusters to the Bad Clusters File. On my final run: Adding 19 bad clusters to the Bad Clusters File. The number should go down each time.
If the number isn't going down, the drive is probably dying, get whatever data you can off. Thank goodness you can read whatever you can without professional equipment.
Sometimes there's buggy firmware. Since you are already reading the SMART there's a good chance the same software that you're using has a place to update the the hard drive's firmware. Check that out--if there is a firmware update, once you get all the files off that you can, update the firmware, then do a clean wipe (slow format) of the hard drive. With the updated firmware it may actually last a lot longer.

But my concern is whether the SSD with bad clusters is borked or not. If yes, I should try and use the warranty.

Working at my last job (repair shop) 10% of all new hard drives seemed to have a few bad clusters out of the factory but those usually aren't a problem. After an initial chkdsk they go away and don't come back. Still though, it seemed like 30% of storage hardware I dealt with had an inconvenient number of errors within the first 3 years.

With regards to warranty policies, cluster errors over time are a grey area, but you should definitely look for the box, receipt, or order number, then call up the warranty hotline and ask what the policy is if saving on the cost of replacing the device anew is worth your time.

One other thing to keep in mind is that some warranties might not cover you if you used the hard drive for such use cases as "server logging", "video surveillance", etc.

I'm also interested in general knowledge on the subject.

Bad clusters tend to appear on everything, sooner or later. When you first get a HDD/SSD/SD card, try to fill it up completely at least once, then delete everything, then whether or not you find bad data, run chkdsk on it. That will also help you discover which media are bogus or counterfeit (shows 1TB but is actually 784GB, funny stuff like that).

Running chkdsk then will find most of the post-factory imperfections. Afterwards you can just use it normally. What you need to watch for is when new bad clusters appear after about 8+ months. Once that starts happening, try to keep track of how often they appear and either go warranty or switch the SSD to read-only data usage.

Do bad clusters on modern (as of 2020) SSD indicate hardware failure?

Related

Recent Posts