How can doc/docx files be converted to markdown or structured text?

Solution 1:

Pandoc supports conversion from docx to markdown directly:

pandoc -f docx -t markdown foo.docx -o foo.markdown

Several markdown formats are supported:

-t gfm (GitHub-Flavored Markdown)  
-t markdown_mmd (MultiMarkdown)  
-t markdown (pandoc’s extended Markdown)  
-t markdown_strict (original unextended Markdown)  
-t markdown_phpextra (PHP Markdown Extra)  
-t commonmark (CommonMark Markdown)  

Solution 2:

Options

  1. Use a Conversion Tool for multi-file conversion.
  2. Use a WYSIWYG Editor for single files and superior fonts.

Which Conversion Tools?

I've tested these three: (1) Pandoc (2) Mammoth (3) w2m


Pandoc

By far the superior tool for conversions with support for a multitude of file types (see Pandoc's man page for supported file types):

pandoc -f docx -t gfm somedoc.docx -o somedoc.md

NB
  • To get pandoc to export markdown tables ('pipe_tables' in pandoc) use multimarkdown or gfm output formats.

  • If formatting to PDF, pandoc uses LaTeX templates for this so you may need to install the LaTeX package for your OS if that command does not work out of the box. Instructions at LaTeX Installation


Which WYSIWYG Editors?

Writeage

In answer to this specific question (docx --> markdown), use the Writeage plugin for Microsoft Word. It also works the other way round markdown --> docx.


Maintain Superior Fonts

If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to docx.

  • Typora
  • iaWriter
  • Markdown Viewer for Chrome.
Programatic Equivalent

For a programatic equivalent, you might get some results by calling a different pdf-engine and their respective options but I haven't tested this. The pandoc defaults to 'pdflatex'.

pandoc --pdf-engine=
pandoc --pdf-engine-opt=STRING

Update: A4 vs US Letter

For outside the US, set the geometry variable:

pandoc -s -V geometry:a4paper -o outfile.pdf infile.md

Footnote

Its worth mentioning here - what's not obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format.

MultiMarkdown supports amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.

But Github's default format uses gfm which also supports tables. I use gfm for Github/GitLab and MultiMarkdown for everything else.

Solution 3:

Given that you asked this question on stackoverflow you're probably wanting a programmatic or command line solution for which I've included another answer.

However, an alternative solution might be to use the Writage Markdown plugin for Microsoft Word.

Writage turns Word into your Markdown WYSIWYG editor, so you will be able to open a Markdown file and edit it like you normally edit any document in Microsoft Word. Also it will be possible to save your Word document as a Markdown file without any other converters.

Under the covers, Writage uses Pandoc that you'll also need to install for this plugin to work.

It currently supports the following Markdown elements:

  • Headings
  • Lists (numbered and bulleted)
  • Links
  • Font styles such as bold, italic
  • Tables
  • Footnotes

This might be the ideal solution for many end users as they won't need to install or run any command line tools - but rather just stick with what they are most familiar.