XML-based MS Office documents are really renamed zip files
04 Jun 2021I had heard this long ago, and very recently tried it – that modern MS Office files can be renamed with the .zip
extension, and then extracted. The result is a folder of all of the original files used to create the document.
This process is particularly great for PowerPoint files because they tend to be made up of lots of images, but works for Word and other files too.
The process, version 1:
- make a copy of the file,
- rename the extension to
.zip
, - extract the
.zip
with the tool of your choice, - then take a look at the data!
I tested this on Ubuntu 18.04 and it worked great.
Through Twitter, I learned you can also do this through the command line, without a rename. So,
unzip test.docx -d test-folder
and then go take a look in test-folder
.
So, what’s going on here? The format for the modern (> 2007) MS Office documents is Office Open XML, or OOXML. From Wikipedia’s entry:
Office Open XML (also informally known as OOXML)[3] is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.
there you go.