Convert Word doc to HTML programmatically in Java
I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.
I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.
Thanks
I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.
JODConverter has a lot of documents, scripts, and tutorials to help you out.
I've used the following approach successfully in production systems where the new MS Word XML format isn't available:
Spawn a process that does something similar to:
http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html
You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).
The other option is to spawn the following sort of command every time you need to do the conversion:
ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"
I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).
While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.
I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.