Storage resiliency: Better distributed or centralized?

Um, why put it all in one basket? Distributed risk an’ all. 2TB much cheaper than 4. Use USB3 for TimeMachine and TBx for VMs.

:blush:

Dave

I have never bought this argument.

If you need 4TB storage and use 2x2TB instead of 1x4TB you have two drives, so are twice as likely to experience a failure. Yes each failure will only lose you 2TB instead of 4TB, but since all data should be backed up at least twice, all you are gaining by using small disks is doing smaller restores, but twice as often.

It has been an argument around as long as computers have been around. The last big one was the mainframe vs. workstation argument. A lesser current one is the cloud vs. local systems. A bigger future one is the energy grid—massive nuclear power stations or thousands of local energy sources.

The arguments are always couched as either/or but if you look into things they always end up being more nuanced.

I’ve always been partial to the thinking in E.F. Schumacher’s Small Is Beautiful. Even though they can be more expensive, small(er) coordinated independent systems tend to be more robust than large, monolithic ones in the long run.

In Adam’s case, I thought it made more sense to use two disks each targeted to purpose. Smaller disks have been around for longer so they’re more stable. TimeMachine does not require high speed interfaces in fact you could get away with USB2, honestly. VM’s certainly would like TB5 and so would the user. So why not physically separate them? Save a little money perhaps? And, if you have to restore them from backup, 2TB is going to be, er, 2 times faster than 4. :slightly_smiling_face:

Dave

P.S. A long time ago I was designing a publishing system using desktop workstations for Land’s End (the clothing merchants) at their Wisconsin headquarters. Their IT director asked if I’d like to see their main systems. Well, sure! We’re standing in a second-floor conference room overlooking a massive IBM 360/70 installation that handled all of their business dealings. So, said I, your backup systems must be monstrous to protect real-time ordering. Oh, he said, There is an absolutely identical 360/70 installation up in Canada that mirrors this one. Gasp.

5 Likes

The arguments are always couched as either/or but if you look into things they always end up being more nuanced.

Yes in any particular case other factors may affect the choice (like wanting speed for VM content but not for backup), but the core principle is the same. In pre SSD days restoring 2TB of data could take long enough to feel that it was worth going small even if it meant doing it more often.

Your reasoning about smaller coordinated systems has merit, but not in the case of 2 2TB smaller disks configured as RAID-0 (concatenated) vs a 4TB drive. The key word there is is “coordination”. In the 2 concatenation drive configuration, there’s no coordination that keeps a single failure from ruining your day. You’d need a higher RAID (1, 5, 6, etc) configuration to mitigate those failure risks inherent in multiple coordinated drives (and considering that maintaining availability to mitigate the lower MTBF of the multi drive configuation has costs in both disk capacity and performance).

2 independent drives to get 4 TB is safer if you can logically separate and manage your workload and workload (e,g. Time Machine on one drive and Virtual Machines on the other) within the capacity constraints of a single drive.

I thought that I read somewhere that higher capacity SSDs may outperform lower capacity ones. Something about the chip layout comes to mind.

Afraid I disagree, as I said above. If you have a proper backup strategy, 1x4 is just as safe as 2x2. All having 2x2 achieves is reducing the size of the restore, which you may need to do twice as often because you have twice as many drives.

Backup is table stakes, of course. My point is depending on your tolerance from workflow disruption, storage resiliency needs to be taken into consideration - especially if you’ve chosen to build a single large storage volume out of multiple drives.

I do agree that single drives don’t introduce resiliency issues at the file system level. And you don’t have to worry as much about resiliency if you can partition your workload across multiple smaller drives. If you lose a drive, recovery may not be too painful (it’s easier to recover 1TB than 4 TB for example).

But if you need a single volume/file system with a large capacity you have 2 choices. Either a high capacity single drive or a storage pool built on multiple smaller drives. The recovery effort is the same should you lose the storage device/pool. But you have a higher probability of having a problem that will impact the multi-drive pool. The MTBF of a n-drive storage pools 1/n of the MTBF of a single drive - or, looking at it another way, you’re n times more likely to have a problem with a n-way storage pool than with a single large drive. That’s why multi-drive storage pools (usually 3 or more drives) are typically set up with some kind of redundancy functionality (e.g. RAID levels) to survive the loss of a drive (or 2, depending on the RAID level chosen).

I also will admit that there may be performance advantages for multiple drive storage pool configuration as opposed to a single large drive.

You can bet that the cloud vendors are doing something similar in the way they’re storing data across multiple drives to provide resiliency for their cloud storage. Or that any peer-to-peer storage configuration is doing something similar (writing redundant data so that failures can be mitigated). Even block-chaining writes data multiple places so that data doesn’t get lost from a single failure.

Yes, backup can recover even a large storage pool. But you don’t want to have to do it if you can avoid it - that’s disruptive to workflow even with the fastest media you can get for both storage and and backup. IMO if you’re recovering from backup frequently, you’ve got something wrong with what you’re doing.

To pick a nit, the clothing merchant is Lands’ End.

Or maybe it’s not a nit. Branding is important to companies.

1 Like

As noted in the original paper (https://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pdf), RAID arrays that recover from failed disks overcomes the more frequent failures compared with single large drives (at that point, as associated with mainframes).

Absolutely and agreed. The redundant information written by RAID and other kinds of distributed storage architectures is what provides the resilience needed to overcome the failure of one (or maybe more) of the storage components. Note that in all of these configurations there are trade offs (overhead) in terms of cost, space and/or performance to gain those increased capacities with resilience.

For example, RAID-5 arrays will experience degraded performance in the event of a disk failure even though the storage will still be available. During the failure, the missing drive’s storage will be synthesized when accessed from information written to the surviving drives. The storage performance will also be degraded during the rebuild process that occurs when the failed drive is replaced/repaired.

1 Like

This is the key point when thinking about RAIDs. They do, in fact, add complexity and when an individual drive fails the rebuild process can take hours (the last one I encountered long ago was 12 nervous hours).

RAIDs are indispensable for servers handling high frequency transactions like SQL databases with 10’s, hundreds, thousands of clients. They are equally good when used for high speed, high volume read/writes for things like high-def video (and that configuration is not redundant).

But for normal file services they’re overkill. Get a good backup system and if a drive fails, you can restore from last night’s backup and for most people it’s a trivial amount of time to update from the last backup. If you’re using TimeMachine, or backup on a similar schedule, it’s an hour’s work they have to repeat.

RAIDs are useful but I now think they have very limited application given their cost.

Dave

Two different solutions for two different proiblems.

RAID provides high capacity and resiliency. But it is not a backup solution. If a file gets deleted or corrupted, that’s it.

Neither are snapshots - they protect against certain kinds of failures (e.g. accidental erasure or modification of files), but don’t do a thing to protect against hardware failures.

And backups don’t provide resiliency. It takes time (sometimes hours or days) to restore from a backup.

Which is why IT departments use some combination of all three. They use RAID to get high capacity and resiliency. The servers are often mirrored for geographic diversity. Snapshots are often employed for things like self-service data recovery. And backups are made to protect against catastrophic failures.

What you need for your home or small office will depend entirely on what you’re doing, how much data is involved, and how much your time is worth, compared to equipment costs.

2 Likes