Extract text from doc and docx
I would like to know how can I read the contents of a doc or docx. I'm using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me know, as long as it works under a linux webserver.
This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF
function docx2text($filename) {
return readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument();
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
return strip_tags($xml->saveXML());
}
$zip->close();
}
// In case of failure return empty string
return "";
}
echo docx2text("test.docx"); // Save this contents to file
Here i have added the solution to get the text from .doc,.docx word files
How to extract text from word file .doc,docx php
For .doc
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
For .docx
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
Parse .docx, .odt, .doc and .rtf documents
I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.
The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost
You can get it here