Any way to control hard disk bit rot in Macintosh?

I have lots of files on my computer (about 20 TB). Much of this is stored on RAID 5 disks using SOFTRaid, and all of it is fully backed up on RAID 5 disks (4 sets, that rotate off-site).

I just had a disk fail, and I was reading up on RAID 5, and I found out that I didn’t actually understand the limitations when it comes to bit rot (not disk failure).

SoftRAID has a “Validation” pass that purports to fix any parity errors. What I just realized is that it works by looking at the data blocks (say, disks 1, 2, and 3 of a 4-disk RAID) and then fixes the parity block (disk 4 in this case) if it isn’t consistent. However, if a bit has flipped in the data blocks (75% chance when there is a parity mismatch), this just “bakes in” the error, and the file is now compromised.

In fact, I had an 8-disk RAID 5 Time Machine drive (which is now in long-term storage) that I validated, and SoftRAID reported that it “fixed” 584 parity errors. Given the figures above, perhaps 73 (1/8) of these changes created valid parity blocks for good files, while 511 (7/8) created valid parity blocks for corrupted files. (Validating a disk is good for identifying bit rot, but can’t, in general, fix it).

zfs and other formats use checksums which supposedly allow you to identify and repair bit rot.

Has anyone successfully set up a relatively cheap and low-maintenance system to allow one’s Mac to use a file system with checksums? I am only going to do this if it is easy. (For instance, maintaining a linux computer and keeping it up to date is not “easy.”) I would envision doing this for my data disks, while continuing to use APFS for my Time Machine backups (going with “easy”).

Any ideas?

I can’t believe it is working as you just described. Any remotely competent error-correction algorithm doesn’t just tell you that something is bad, but also lets you algorithmically compute the correct values, whether the error is in the data or in the error-correcting data.

if SoftRAID’s implementation of RAID-5 always assumes that the ECC data is wrong, then it’s completely useless. Given how popular this product is, and how many people (including Apple) have supported it over the decades, I have to believe that its algorithm isn’t this brain-dead.

What you believe and what is true can be different things. This is just the way that RAID 5 works. I don’t think it has anything to do with “EEC” data. The hard disk reports bits, SoftRAID (and any RAID) calculates parity bits. If the disk data is corrupt, the parity will be corrupt.

I pulled out an old Time Machine disk to look for some files that I didn’t keep, and I got a predicted disk error. I thought I should fix all the parity disk errors before replacing that RAID slice. I have so far found two parity errors. This is almost certainly due to bit rot on the RAID slices of the Time Machine disk.

Note that bit rot on a Time Machine disk is not a game-stopping problem. What is the chance that the file you want to restore is one with bit flips? Very small.

But bit rot on your main disk is a serious problem. Why are you saving those files unless you want them in the future? And, unless your file system is doing checksums, there is no way to identify which files might be corrupt.

RAID doesn’t help solve the problem of bit rot. It helps you identify the presence of bit rot, but not which files might be corrupt (since it is on block basis, not a file system basis).

My understanding is that zfs and other “modern” file systems do help protect you against bit rot. Unfortunately, Apple’s APFS does not.

If that’s the way RAID is supposed to work, then rebuilding a failed drive would be impossible.

David, it may be impossible to convince you of this here, but, yes, RAID works, and, yes, it doesn’t help with bit rot.

Let’s say one block on a RAID 5 4-disk set is:

Disk 1: AAAA
Disk 2: BBBB
Disk 3: CCCC

Then the software calculates the parity block:

Disk 4: DDDD

Then if any of the disks goes bad (stops working), I can just pull that disk out and the RAID software will know from other three values, what value to put on that (new) disk. That’s how parity works.

However, let’s say that a gamma ray comes in and flips one of these bits (actually more likely a weak magnetic field), so that now

Disk 1: AAEA

Now the parity doesn’t match, and if one disk fails the values put back depend on which disk failed. If disk 1 failed (25% chance) then it will get the original values back. But if disks 2 or 3 fail (50% chance) the parity + incorrect value in disk 1 will end up giving an incorrect value in that disk. If disk 4 fails, then new parity information will be put on it (as below).

What SoftRAID’s validation step does is to look at the current values of disks 1, 2, and 3 and recalculate, if necessary, the parity on Disk 4. So it might change it to

Disk 4: DDFD

(Obviously, I am making up these values).

Now the 4 RAID slices are consistent, and if any one of the disks 1, 2, 3 or 4 fails, it can be reconstructed from the other three. But the reconstructed blocks will now be:

Disk 1: AAEA (corrupt value)
Disk 2: BBBB
Disk 3: CCCC
Disk 4: DDFD (correct parity disk)

My file is corrupt, but the RAID is consistent. (Think how false memories work—other memories are massaged so that the set of memories is internally consistent even though some are false).

If my file system were to write down additional data, like checksums, that would give it enough information to fix the flipped bit and reconstruct the original values in the file. However, APFS doesn’t do this. zfs (among others) does do this.

Can we get back to my original question: has anyone ideas about how to store terabytes of files on a “modern” file system with checksums and error correction that is (a) usable by macOS, and (b) not a maintenance nightmare?

Cheers