Convert docx to PDF

This answer passes all tests, but the flow chart one in your test document.

sudo apt-get install unoconv
doc2pdf respondus-docx-sample-file.docx

Why is this better than other methods suggest thus far?

I have tested the other methods suggested so far (especially oowriter and ebook-convert), but they pass less tests than this method. The ebook-convert method strips the margins and a part of the texts out of the document.

This method even yields better results than a professional converter as rainbowpdf.

I also tried converting it to html, but the drawing with the square in the circle and the flow chart are incorrect.

Why does the flow chart test fail?

It seems that libreoffice and unoconv have some problems with correctly rendering the flow chart that is in the .docx file. This is probably because it was made using smart art in Microsoft Office. That is the problem. That is a bug also discussed on this thread. The textual and visual information is present in the pdf resulting from the above method as you can see (I had to select the text, though).

The flowchart that does not display completely as expected.

The font color, for instance, is not properly read and some lines are too long. I am not aware of any linux solution that is able to display smart art correctly. :(

This is also the reason why all the print solutions posted on this page will not satisfy you.

In short

In short, what you are doing is really hard and there are at present no solutions that will fully satisfy you. The achilles' heel of docx2pdf conversions is the smart art. If you can live without that or if you can find a way to spot smart art and convert it somehow into an image, you can reach your goal.

Option 1. Force your users to deal with the problem

This is a very inelegant solution. Your content creators could save their smart art as jpg as described in the office help pages and hence the conversion would be possible on your server.

Option 2. Hack your way around the problem

If the flow charts are often very similar and depending on how good a developper you are, you could try and convert the smart art separately. You could, extract the drawing1.xml file from the .docx cluster of documents and then use natural language processing and some crazy hacks to rebuild a the smart art. For instance, you'd have to mess with this type of xml:

<dsp:txBody>
<a:bodyPr spcFirstLastPara="0" vert="horz" wrap="square" lIns="8255" tIns="8255" rIns="8255" bIns="8255" numCol="1" spcCol="1270" anchor="ctr" anchorCtr="0">
<a:noAutofit/>
</a:bodyPr>
<a:lstStyle/>
<a:p>
<a:pPr lvl="0" algn="ctr" defTabSz="577850">
<a:lnSpc><a:spcPct val="90000"/>
</a:lnSpc>
<a:spcBef>
<a:spcPct val="0"/>
</a:spcBef>
<a:spcAft>
<a:spcPct val="35000"/>
</a:spcAft>
</a:pPr>
<a:r>
<a:rPr lang="en-US" sz="1300" b="1" kern="1200"/>
<a:t>All three sides are different lengths
</a:t>
</a:r>
</a:p>
</dsp:txBody>

Or as a minimal solution you at least extract the text (<a:t>?) from the file and save it in an easier way. Or if the flow-charts of your pdfs are all the same, you could write a script to change the text color and the line length in the xml itself. Then you could run doc2pdf and you'd have a file that essentially has all the right info, but maybe not the formatting. In the case of flow charts you'd probably also want to include some of the formatting, because the formatting is part of the info.

Option 3. Use a third party service

I have done some more research the past few days and I have found a service that does the conversion perfectly: zamzar. Zamzar allows you to upload a docx file and then emails you a link. They also have a (paying?) service where you can send any file to [email protected] and then get the converted file back in your inbox. You could easily build a system around this where you automatically send the file and parse it from the email. This is not so much work and it the end result is the best.

Notes

  • If anyone has other services that do the same, please feel free to edit them in.
  • I have mailed the zamzar support to ask whether they have an api. That would be even easier.
  • Maybe apose for .NET and Java could also help out? Or docx4java as in this very related SO post.
  • Another option is to look into the the odf-converter which seems dated and is dependent on openoffice rather than libreoffice.
  • I can now confirm that the java jodconverter also suffers fails the flow chart conversion.

I have actually taken the time to test the different methods proposed on this page. Please back any comments up with actual tests.


This is a command-line solution that works decently --- but uses proprietary software.

I think that the basic problem is that Microsoft Word formats are fully understandable just for Microsoft Word (even there, there are differences between versions --- there are Word files from the past that opens incorrectly formatted in newer versions). All the other solutions are approximations and hacks, so they will work or not depending on the file.

So to be sure you need to process your .docx files with a Microsoft Word installation (and yes, I think it's their option and it's fair. If you do not want to use Word, don't use it --- I go with LaTeX for my work, but it's difficult to convince the rest of the world around...).

I am using since ages Crossover for running Microsoft Office in my Linux Desktop (1), finding it quite useful. Maybe it works with wine too --- never tried.

I do the conversion using this configuration:

1) I have Crossover installed

2) I have my version of Microsoft Office installed under Crossover

3) In Microsoft Word, disable "background printing"

4) I have cups-pdf printer installed and selected as default printer.

5) To do the conversion, run (hints here):

~/cxoffice/bin/wine --cx-app winword.exe respondus-docx-sample-file.docx /q /n /mFilePrintDefault /mFileExit

6) Your converted file will appear in ~/PDF/ directory.

You document come out almost perfectly (there is some misalignement on answer #2, that are shown in my Office Word 2007 when running under Crossover --- I do not know if it's related to my Windows version).

pagew 1-2

pages 3-4

Now, the problem is that the graphic word interface will pop-up --- I do not know how to make it "headless". Command line options for Word didn't help...

(1) I am in no way related to Codeveawers --- just a happy user.


If you have Libreoffice installed, you can try to convert using that. Just press Ctrl+Alt+T on your keyboard to open Terminal. When it opens, run the command(s) below:

libreoffice --headless -convert-to pdf <file_name>.docx -outdir output/path/for/pdf

Another option is to install Cups PDF.

To do so just press Ctrl+Alt+T on your keyboard to open Terminal. When it opens, run the command(s) below:

sudo apt-get install cups-pdf

Then create a new printer, set it as a PDF file printer, and name it whatever you want, as long as you know the name, then run:

oowriter -pt pdf your_word_file.docx

And your PDF file will be in ~/PDF.