Advantages of DOCX Format Over DOC

Today I have accidentally found out that the .docx is the same .zip (or there is no big difference between them). When you change the .docx to the .zip and open with WinRAR you see a bunch of XML files in the folders. In that XML file it is stored the text, fonts, owner, last modified and so on. In a word all the information is being stored as an XML data.

But the same is not right for .doc extension files. It is impossible to open them as .zip op as .rar.

So question: What is the advantage of storing .docx’s data in XML that Microsoft has changed the way of storing data? Indeed I want to know not the advantage of XML format but why Microsoft is using multiple XML files to store the .docx data. It turnes that .docx is not new format in the root.


A .docx file can store embedded resources, like image files, not just XML files. Instead of encoding stuff in base64 or something and storing it within an XML file or inventing yet another binary serialization format, they decided to go with the standard ZIP format.

Beside that, XML is a very verbose file format containing lots of redundant patterns. You can get a high compression ratio for XML files.

By the way, I don't really get the "tricking us" part. Is it better to invent a new cryptic file format from scratch or use a standard, known format?


The Wikipedia article sums it up pretty nicely:

"Microsoft came under increasing pressure to adopt an open file format, in particular several nations adopted rules that official documents should be in an open format."

Edit: And zipping it up makes a lot of sense, as the XML is very verbose, and naturally compresses really well.


Using a renamed .zip file is a pretty common practice - for example Quake III .pak files are really .zip files. There's no point inventing your own compressed file format when perfectly good ones exist already.