Do Pages files contain "fluff"?

When I receive .docx files from clients I open them in PAGES and this works surprisingly well. I can even send .docx files back to the original author if needs must.

I recently noticed that files grow massively when I do this.
A .docx file of just 18 KB, swells to 826 KB once I make a PAGES file of it.

Is this normal? And if so, why?

While I can’t speak directly to Word → Pages, I am familiar with proprietary format conversions. Typically, the owner of the file format does not publish and “how to read my files and write to them” document. There is typically one or more third-parties that use brute force reverse engineering (this was the file we started with, we look at the binary, take our best guess at what ended up where, and write conversion code). The third-party the creates a “generic” or “neutral” model of the proprietary format.

The third-party then sells a set of libraries with the generic/neutral model to folks like Apple. If Apple wants to read and write to the proprietary format, it probably includes those libraries or at least the parts of the model that it used. Thus increasing the size.

My best guess with just a little insider knowledge.

2 Likes

I think in the case of the Office file formats, they are an ‘open’ standard of sorts, which anyone can implement. So not sure that is the cause of the file size increase in this case.

I don’t know why, but Pages seems to create large files no matter what. I just created a new blank document and wrote two words in it. The resulting file is a bit over 100kB!

The default Office file format (.docx, .xslx) is actually a zipped folder containing XML. So since it’s been compressed it will be generally smaller than another representation of the same data. (Judging by the file size people are reporting from Pages, those files are not compressed.)

Note that you can look inside the Office file by just changing the extension to .zip and running it through a decompression utility. It often works to extract original image files that are in an Office document.

Dave

Interesting approach, although this didn’t work when I tried it. What does work to get images out is to convert to HTML. A topic for another day.

To look inside I just opened both a .pages and a .docx file with BBEDIT


Yes @jzw, the famous MSO documentation about their file formats. Close to 8,000 pages long :sunglasses:

1 Like

Pages docs usually contain a PDF or PNG of the actual document or at least its first page. In the Finder this gets used to create an icon showing doc content along with a preview of the actual contents. I would imagine that storing graphics of a rendered document along with said document would result in much larger overall file size than just a a bunch of compressed XML describing the render, but without containing any actual render (Word).

1 Like

Interesting observation. I just created a new Word doc with the content “two words”. The saved file size is 26kB. Printed it to a PDF from Word - that is 16kB. Opened it with Pages and saved it as a pages document: 610kB!
I changed the extension from pages to txt and opened it with TextEdit and there was tons of unreadable “fluff” in there. That must be some of the conversion overhead described above. Perhaps also theme and font definitions?