When I receive .docx files from clients I open them in PAGES and this works surprisingly well. I can even send .docx files back to the original author if needs must.
I recently noticed that files grow massively when I do this.
A .docx file of just 18 KB, swells to 826 KB once I make a PAGES file of it.
While I canât speak directly to Word â Pages, I am familiar with proprietary format conversions. Typically, the owner of the file format does not publish and âhow to read my files and write to themâ document. There is typically one or more third-parties that use brute force reverse engineering (this was the file we started with, we look at the binary, take our best guess at what ended up where, and write conversion code). The third-party the creates a âgenericâ or âneutralâ model of the proprietary format.
The third-party then sells a set of libraries with the generic/neutral model to folks like Apple. If Apple wants to read and write to the proprietary format, it probably includes those libraries or at least the parts of the model that it used. Thus increasing the size.
My best guess with just a little insider knowledge.
I think in the case of the Office file formats, they are an âopenâ standard of sorts, which anyone can implement. So not sure that is the cause of the file size increase in this case.
I donât know why, but Pages seems to create large files no matter what. I just created a new blank document and wrote two words in it. The resulting file is a bit over 100kB!
The default Office file format (.docx, .xslx) is actually a zipped folder containing XML. So since itâs been compressed it will be generally smaller than another representation of the same data. (Judging by the file size people are reporting from Pages, those files are not compressed.)
Note that you can look inside the Office file by just changing the extension to .zip and running it through a decompression utility. It often works to extract original image files that are in an Office document.
Pages docs usually contain a PDF or PNG of the actual document or at least its first page. In the Finder this gets used to create an icon showing doc content along with a preview of the actual contents. I would imagine that storing graphics of a rendered document along with said document would result in much larger overall file size than just a a bunch of compressed XML describing the render, but without containing any actual render (Word).
Interesting observation. I just created a new Word doc with the content âtwo wordsâ. The saved file size is 26kB. Printed it to a PDF from Word - that is 16kB. Opened it with Pages and saved it as a pages document: 610kB!
I changed the extension from pages to txt and opened it with TextEdit and there was tons of unreadable âfluffâ in there. That must be some of the conversion overhead described above. Perhaps also theme and font definitions?