AppBITS: Hyperspace Reclaims Space Used by Identical Files

ace · March 19, 2025, 7:08pm

Originally published at: https://tidbits.com/2025/03/19/appbits-hyperspace-reclaims-space-used-by-identical-files/

Is it weird that what I like most about the new space-reclamation app Hyperspace is its documentation? That’s because Hyperspace comes from John Siracusa, perhaps best known for his exhaustive reviews of Mac OS X for Ars Technica. Even if you have no need for Hyperspace, reading its FAQ will teach you a lot about one of the core features of the standard macOS APFS filesystem.

Unlike apps such as MacPaw’s Gemini, which search for and delete duplicates—often with some degree of intentional fuzziness—Hyperspace scans your drive for identical files and turns all but one into space-saving clones that occupy virtually no space. For example, if you have three identical copies of a 2 GB file, Hyperspace keeps one copy and converts the other two into space-saving clones that initially take up no additional space. If you later modify one of the clones, only then will it grow to the file’s full size.

Space-saving clones have been with us since the dawn of APFS, and you create one every time you use the Finder to duplicate a file. Unlike symlinks and hard links, which are ways of making the same file appear in multiple locations, space-saving clones are regular files. Changes made to one do not affect any others, although a changed clone immediately occupies all the space it didn’t before the change.

Another notable aspect of Hyperspace is its business model. The app is free to download from the Mac App Store, and you can scan your drive to see how much space it can reclaim. Only if you want to reclaim space do you need to pay. Your choices are $9.99 to use it for a month, $19.99 for a year, or $49.99 for lifetime access. These are one-time purchases, though you can instead opt for a subscription at $9.99 per month or $19.99 per year—the only win there would seem to be not having to remember to purchase again.

When I ran Hyperspace, it identified just 3.24 GB of potential savings in the 900 GB of data on my 1 TB drive. I am curious about which files on my drive are duplicates, but Hyperspace reserves that information for those who pay—sensibly enough, since you could manually create Finder duplicates to replicate its functionality. I’m not that curious, though I do wonder how I ended up with over 11,000 duplicate files created in some way other than through the Finder.

mpainesyd · March 19, 2025, 8:18pm

Looks like a useful app - thank you for the review.

Is a “clone” similar to an alias? I regularly use aliases when creating a folder of references for new projects (option-cmd drag from the original file to the destination folder).

Update: It reclaimed 8Gb for me. Some duplicated files (mostly videos) go back at least a decade - through several Mac migrations.

ddmiller · March 19, 2025, 8:41pm

Not really - aliases will always reference the original file, even when it changes.

A clone is a way to save space when you make an identical copy of a file to a different location on the disk. But what’s cool about this is if the original file changes, or the copy, the clone is automatically de-referenced and the file system will create separate copies. The clone exists only as long as the two files are not updated.

It’s one of the reason why copying a/some file(s) on an APFS disk is so fast compared with HFS+ - the file system just registers the clones.

Will_B · March 19, 2025, 11:17pm

As you say, the docs alone almost sound worth the price. APFS has seemed to have a lot of voodoo that I have been uneasy about.

Shamino · March 20, 2025, 2:31pm

Copy-on-write semantics is not new or unique to APFS. Server-class file systems have offered the feature for a very long time, typically for the purpose of implementing snapshots.

The idea is that every disk block has a “reference count” associated with it, representing how many files are using that block.

When you make a new snapshot, the new snapshot simply increases the reference count of all the disk blocks that are in-use (there are some optimizations to speed this up), which is why it doesn’t increase the amount of storage consumed by any significant amount.

When you later write to a file that was snapshotted, the copy-on-write semantics cause the overwritten disk blocks to be duplicated. New blocks are allocated for the newly-written file and the reference counts for the original file’s blocks are decremented.

When you delete a file, its blocks’ reference counts are decremented.

When a snapshot is deleted, the reference counts for all of the snapshot’s blocks are decremented.

When a block’s reference count goes to 0, it mean no file in any snapshot is using it, and the block is released to the pool of free blocks.

File cloning is the exact same thing, but on a per-file basis instead of a per-file-system basis. When you make a clone, it creates a new directory entry in the file system, but that entry references the same disk blocks as the original, incrementing the reference count for all of its blocks. Copy-on-write semantics will cause blocks to be cloned only when they are overwritten.

A really good description of the (snapshot) concept is in a 2011 white paper from Network Appliance (maker of high-end file servers):

NetApp Snapshot Technology

APFS does something conceptually similar.

nextstep · March 20, 2025, 11:46pm

Adam writes:

Space-saving clones have been with us since the dawn of APFS, and you create one every time you use the Finder to duplicate a file.

So, I freshly format a drive as APFS. I create files on that drive. Every duplicate I make of those files is actually not a duplicate. By default, it is a clone. So, in this scenario, where do the “duplicates” that Hyperspace finds come from? There are no non-clone duplicates to be found on an APFS volume, right?

blm · March 21, 2025, 1:50am

There aren’t if you only duplicate files in the Finder (or via cp -c from the command line). However, say you download the same file twice. I don’t think any browser is smart enough to clone the first download. Same thing if you copy a file off of an external volume then copy it again. Or you save a file in some app then save it again somewhere else. In those cases you’ll end up with files with identical data both taking up their full size. Those are the kinds of things Hyperspace will find and (with the paid version), delete one and clone the other.

Dafuki · March 21, 2025, 1:55am

Well, yes, there are or could be. Let’s say you copy twenty images to an external volume and then 3 months later you copy them back to a different directory forgetting you already had them on the drive. Those are duplicates and not clones. Hyperspace will find them and convert them to clones; thereby saving the space.

Dave

mschmitt · March 21, 2025, 3:50am

Every application’s files are separate, unless you duplicate the application. But many of the contained files could be duplicates. My Mac has 80 copies of the Sparkle framework, multiple copies of Electron, and so on.

There’s also just badly designed apps. For example, iMazing downloads .PNG files for each app icon, and saves them separately for every version of the application. It never purges them! So I only have 261 apps in the iMazing library, but there are 17,359 icon files. It appears that most of them are exact duplicates.

nextstep · March 21, 2025, 5:11am

Thanks, guys, for excellent examples to answer my question.

At the risk of going off into the (hyper)weeds here, I have a followup.

If you later modify one of the clones, only then will it grow to the file’s full size.

Thus, it appears to makes sense never to update a clone; always to update the original, preserved file. But, after Hyperspace has worked its magic, is there any way to tell which file in the processed set is now the original?

…if the original file changes, or the copy, the clone is automatically de-referenced and the file system will create separate copies. The clone exists only as long as the two files are not updated.

OK. Whenever I update the original, which is now the most up-to-date version of the document, the clones disappear. Since this is certain to happen at some point, why not just locate and delete the duplicates in the first place, with something like TidyUp?

ddmiller · March 21, 2025, 10:50am

Fwiw by default Hyperspace ignores application bundles and does not look for clones there. You can force it, but I believe John Siracusa considers that risky (based on the FAQ, linked in Adam’s article.)

There’s no difference in effect. Once either file is modified, APFS knows they are not the same and de-references the clone from the source, forcing it to write out the other. What am not sure of is what happens if you overwrite the source which has many clones. It may be that all of the clones are de-referenced and written out; it may be that one of the unmodified clones becomes a new source. (That may be in the FAQ if you want to read it.)

gingerbeardman · March 21, 2025, 12:31pm

Requires macOS 15.0 or later.

I find this both surprising and disappointing given how easy I found supporting macOS 12 and up for Stapler (AppBits).

blm · March 21, 2025, 5:05pm

I think it’s best not to think of “original” and “clone”. When you duplicate a file in the Finder (or other ways that do a clone vs a copy), you now have two files that reference the same data (so take up 1/2 the space of actual copies). If you duplicate either of those files then you have three files that reference the same data (taking up 1/3 the space), etc. If you then rewrite any one of the n files that reference the data, that file is given a new blob of data that only it references and the number of files referencing the data it used to reference is decremented (once nothing references it it’s freed).

Shamino · March 21, 2025, 5:20pm

Absolutely yes. A clone is functionally no different from a copy, except for the fact that it will not consume any additional storage until it is modified. There is no concept of “original” vs. “clone”. The two files are 100% equivalent.

That is to say, a clone is not an alias.

Diletante · March 22, 2025, 11:03am

I have been playing with Hyperspace and the documentation really is fabulous.

I started cautiously with some smaller folders but Hyperspace is so fast on my MacBook Pro, M3 Pro that I eventually just chose my entire user directory and it scanned it in under a minute. I have about 700GB on my 1TB internal SSD.

Initial savings were <2GB using the default settings but the trick is to add file types that are specific to your domain. When I added FileMaker and Vectorworks files, some of the largest, non-media files on my system, Hyperspace was able to find an additional 6GB of savings. I’ll keep sniffing around and see what else might be included.

The great thing about this is that I want those duplicated files where they are because of their associations with other files in a suite but I get to have my cake and eat it as I don’t have to waste space on them, back them up, etc. If they diverge later because of my edits, they are there.

Well done John Siracusa!

jdakq · March 25, 2025, 2:13am

Well, I would have liked to have tested it - but says it requires macOS 15+, and I’m at 14.x (Ventura) at the moment due to the age of my iMac even though my main SSD is APFS. Alas.

tidbits22 · March 25, 2025, 3:39am

How does it work with iCloud?

I am confused by the fact that on my MBP many of the files are in “~/Library/Mobile Documents/com~apple~CloudDocs/Documents/”

tidbits22 · March 25, 2025, 3:41am

Are you saying Hyperspace doesn’t look at those files by default? Why not?

mac · March 25, 2025, 5:49am

Definitely a cool program and great documentation. I tried it on my internal drive, my external “big data” drive and my wife’s internal drive after I read about it on Daring Fireball. It only found one percent or so of duplicate data. Sadly, I didn’t find that sufficient enough to warrant paying. I loved learning about APFS clones, though.

Diletante · March 25, 2025, 8:38am

Hyperspace is by default very conservative, which is a good thing for this kind of tool which is operating on your data. In the settings you can either add individual file types or set it to do everything once you are content it is doing what you expect and your tests show that no harm is being done.

I appreciate this approach because caution is always advisable when a tool like this might encounter a file type never seen by the author. He does some analysis on selected files and always errs on the safe side.