How can doc/docx files be converted to markdown or structured text?
Solution 1:
Pandoc supports conversion from docx to markdown directly:
pandoc -f docx -t markdown foo.docx -o foo.markdown
Several markdown formats are supported:
-t gfm (GitHub-Flavored Markdown)
-t markdown_mmd (MultiMarkdown)
-t markdown (pandoc’s extended Markdown)
-t markdown_strict (original unextended Markdown)
-t markdown_phpextra (PHP Markdown Extra)
-t commonmark (CommonMark Markdown)
Solution 2:
Options
- Use a Conversion Tool for multi-file conversion.
- Use a WYSIWYG Editor for single files and superior fonts.
Which Conversion Tools?
I've tested these three: (1) Pandoc (2) Mammoth (3) w2m
Pandoc
By far the superior tool for conversions with support for a multitude of file types (see Pandoc's man page
for supported file types):
pandoc -f docx -t gfm somedoc.docx -o somedoc.md
NB
-
To get
pandoc
to export markdown tables ('pipe_tables' in pandoc) usemultimarkdown
orgfm
output formats. -
If formatting to PDF,
pandoc
usesLaTeX
templates for this so you may need to install theLaTeX
package for your OS if that command does not work out of the box. Instructions at LaTeX Installation
Which WYSIWYG Editors?
Writeage
In answer to this specific question (docx --> markdown
), use the Writeage plugin for Microsoft Word. It also works the other way round markdown --> docx
.
Maintain Superior Fonts
If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to docx
.
- Typora
- iaWriter
- Markdown Viewer for Chrome.
Programatic Equivalent
For a programatic equivalent, you might get some results by calling a different pdf-engine and their respective options but I haven't tested this. The pandoc defaults to 'pdflatex'.
pandoc --pdf-engine=
pandoc --pdf-engine-opt=STRING
Update: A4 vs US Letter
For outside the US, set the geometry variable:
pandoc -s -V geometry:a4paper -o outfile.pdf infile.md
Footnote
Its worth mentioning here - what's not obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format.
MultiMarkdown supports amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.
But Github's default format uses gfm
which also supports tables. I use gfm
for Github/GitLab and MultiMarkdown
for everything else.
Solution 3:
Given that you asked this question on stackoverflow you're probably wanting a programmatic or command line solution for which I've included another answer.
However, an alternative solution might be to use the Writage Markdown plugin for Microsoft Word.
Writage turns Word into your Markdown WYSIWYG editor, so you will be able to open a Markdown file and edit it like you normally edit any document in Microsoft Word. Also it will be possible to save your Word document as a Markdown file without any other converters.
Under the covers, Writage uses Pandoc that you'll also need to install for this plugin to work.
It currently supports the following Markdown elements:
- Headings
- Lists (numbered and bulleted)
- Links
- Font styles such as bold, italic
- Tables
- Footnotes
This might be the ideal solution for many end users as they won't need to install or run any command line tools - but rather just stick with what they are most familiar.