How can I convert a docx document to html using php?

I want to be able to upload an MS word document and export it a page in my site.

Is there any way to accomplish this?


Solution 1:

//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('\n',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

ZipArchive and DOMDocument are both inside PHP so you don't need to install/include/require additional libraries.

Solution 2:

One may use PHPDocX.

It has support for practically all HTML CSS styles. Moreover you may use templates to add extra formatting to your HTML via the replaceTemplateVariableByHTML.

The HTML methods of PHPDocX also allow for the direct use of Word styles. You may use something like this:

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));

If you want that all your tables use the MediumGrid3-accent5 Word style. The embedHTML method as well as its version for templates (replaceTemplateVariableByHTML) preserve inheritance, meaning by that that you may use a predefined Word style and override with CSS any of its properties.

You may also extract selected parts of your HTML using 'JQuery type' selectors.

Solution 3:

You can convert Word docx documents to html using Print2flash library. Here is an PHP excerpt from my client's site which converts a document to html:

include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);

It converts a document which path is specified in $wordfile variable to a html page file specified by $htmlFile variable. All formatting, hyperlinks and charts are retained. You can get the required const.php file altogether with a fuller sample from Print2flash SDK.

Solution 4:

this is a workaround based on David Lin's answer above removing "w:" in a docx's xml tags leave behing Html like tags

    function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = new DOMDocument("1.0", "utf-8");
            $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING|LIBXML_PARSEHUGE);
            $xml->encoding = "utf-8";
            // Return data without XML formatting tags
            $output =  $xml->saveXML();
            $output = str_replace("w:","",$output);

            return $output;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}