Disks reports "Disk is OK, 5439488 bad sectors"

That seemed to me to be a helluva lot of bad sectors. This is a SATA M.2 SSD, but I thought those things took care of hiding bad sectors without the operating system having to bother its pretty little head about them. Ubuntu 20.04 seems to be able to count these bad sectors, yet still announce that the disk is "OK".

Is the disk "OK"? I'd been having mysterious error messages announcing that "Ubuntu 20.04 experienced an internal error" with the /var/crash report suggesting that the problem is (being detected by?) gnome-control-center. The system ran just fine following this error—until I rebooted. On two occasions, a reboot after this error failed completely, requiring a complete new install of 20.04.

Why does Disks declare a drive on which it is able to detect 5439488 bad sectors "OK"? I had assumed Disks was telling me "you've got an ageing SSD but it's all under control. But if the bad sector count is responsible for the reboot failures (my assumption, not fact), why is Disks seemingly giving the SSD a pass?

My initial working hypothesis was that the SSD was failing fast. An early reply to this post (which now seems to have disappeared) was certain that 5439488 bad sectors was a sure sign the drive needed replacing.

I now believe that to be wrong.

For one thing, the bad sector count is remaining stable at 5439488 even now, several days later. And my idea that the overprovisioning taking care of bad sectors (which are going to be a fact of life for SSDs) is a function the SSD controller keeps invisible to the operating system, appears to have been a misconception. The overprovisioning must be visible, because the capacity the drive publishes to the world is 256GB. Internal overprovisioning would, I believe, only offer 240GB.

My original question boiled down to this: does overprovisioning conceal bad sectors from the operating system until the overprovisioning runs out, in which case the 5439488 bad sectors will be overflow that is eating usable capacity; or is the operating system in fact reporting every failed sector, including those taken care of by overprovisioning?

However, it's now clear to me that overprovisioning, probably handled by the SSD controller (am I right?) is being reported to SMART, and that Gnome Disks and GSmartControl must be reading this from SMART.

Two short tests and one extended test run with GSmartControl, BTW, all completed without error. LIke Gnome Disks, GSmartControl reports the drive as being "OK"

By my reckoning, the current (stable) bad sector count amounts to around 2.8GB. An SSD that was secretly overprovisioning would be announcing 240GB, providing a reserve of around 16GB. We're well within that limit.

I started out with the assumptions that there were connections between 1. Gnome Disks' bad sector count, 2. The "Ubuntu 20.04 experienced an internal error" message and 3. The twice-experienced failure to boot.

But I may be quite wrong about this. The last Ubuntu internal error message was not followed by failure to boot. As I say, the bad sector count remains stable and the system seems to be running well.

The first draft of this post was originally deprecated by the mod as being opinion-based. I'm not sure what that means—yes, it is my considered opinion now after much experimentation and deliberation that the SSD in question is still in decent, usable nick and doesn't need replacing (and that the non-booting problem isn't connected).

The bottom line question here would then be: is this a fair assessment? What am I missing.

Secondary questions: am I right in assuming that an SSD that announces its full capacity is still handling bad sectors internally, but reporting them to SMART? Does an SSD sold as, eg, 240GB handle that 16GB overprovisioning internally without reporting to SMART?

The answers are apparently not easy to come by on the Web. Can anyone here help?

-- Chris


Solution 1:

If you got 5439488 bad sectors I would replace the drive as it's a lot of bad sectors. Backup and replace the drive it has a big chance of failing soon.

Read what is a bad sector: https://www.howtogeek.com/173463/bad-sectors-explained-why-hard-drives-get-bad-sectors-and-what-you-can-do-about-it/.

Hope this helps. Jonathan Steadman.

Solution 2:

hex(5439488) '0x530000'

It's more likely that this number is a bit pattern. Many of the Raw Values listed by smartctl are bit patterns. How to interpret them usually depends on the manufacturer concerned.

Solution 3:

A large number of bad sectors is not necessarily a problem. But if the number of bad sectors is increasing (especially on spinning rust), or you have run out of replacement sectors (on either mechanical or SSDs), failure may come soon. (Write leveling is suppose to help this, but it may make things worse if you are writing a majority disk frequently. You should use trim before doing a full disk rewrite to mitigate this.)

Remember also, SSDs have a limited number of write cycles per block; SSD's use wear leveling to try to give every block the same number of writes to extend the life of the drive. If the SMART info lists it, this should be shown as Wear_Leveling_Count and the number under current value is a percentage left. When this reaches zero, the drive will die, probably by no longer accepting writes.