XML-based MS Office documents are really renamed zip files

command-line-interfaces ms-office xml zip

I had heard this long ago, and very recently tried it – that modern MS Office files can be renamed with the .zip extension, and then extracted. The result is a folder of all of the original files used to create the document.

This process is particularly great for PowerPoint files because they tend to be made up of lots of images, but works for Word and other files too.

The process, version 1:

  • make a copy of the file,
  • rename the extension to .zip,
  • extract the .zip with the tool of your choice,
  • then take a look at the data!

I tested this on Ubuntu 18.04 and it worked great.

Through Twitter, I learned you can also do this through the command line, without a rename. So,

unzip test.docx -d test-folder

and then go take a look in test-folder.

So, what’s going on here? The format for the modern (> 2007) MS Office documents is Office Open XML, or OOXML. From Wikipedia’s entry:

Office Open XML (also informally known as OOXML)[3] is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.

there you go.

© Amy Tabb 2018 - 2023. All rights reserved. The contents of this site reflect my personal perspectives and not those of any other entity.