Convert PDF to HTML in PHP?

Solution 1:

1) download and unpack the .exe file to a folder: http://sourceforge.net/projects/pdftohtml/

2) create a .php file, and put this code (assuming, that the pdftohtml.exe is inside that folder, and the source sample.pdf too):

<?php
$source_pdf="sample.pdf";
$output_folder="MyFolder";

    if (!file_exists($output_folder)) { mkdir($output_folder, 0777, true);}
$a= passthru("pdftohtml $source_pdf $output_folder/new_file_name",$b);
var_dump($a);
?>

3) enter MyFolder, and you will see the converted files (depends on the number of pages..)

p.s. i dont know, but there exists many commercial or trial apis too.

Solution 2:

Cross-platform solution using Xpdf:

Download appropriate package of the Xpdf tools and unpack it into a subdirectory in your script's directory. Let's assume it's called "/xpdftools".

Add such a code into your php script:

$pdf_file = 'sample.pdf';
$html_dir = 'htmldir';
$cmd = "xpdftools/bin32/pdftohtml $pdf_file $html_dir";

exec($cmd, $out, $ret);
echo "Exit code: $ret";

After successful script execution htmldir directory will contain converted html files (each page in a separate file).

The Xpdf tools use the following exit codes:

  • 0 - No error.
  • 1 - Error opening a PDF file.
  • 2 - Error opening an output file.
  • 3 - Error related to PDF permissions.
  • 99 - Other error.

Solution 3:

What you are essentially looking to do is to reflow the PDF file. I'm not sure this exists, and is at best very difficult to do.

It would be possible to write some code to do what you need for your specific file, but to do so for a general case I believe would be impossible.

I have written an article here that explains why I believe reflowing PDF is flawed: http://www.planetpdf.com/enterprise/article.asp?ContentID=PDF_Reflow_in_Microsoft_Word_2012_Is_it_any_good

Of particular interest is the paragraph beginning "Let's use a newspaper story to illustrate the problem."

You may want to look into what IDRsolutions (which for transparency, is where I work!) has to offer.

We are currently in the process of putting our PDF to HTML5 and PDF Conversion software in the cloud: http://www.idrsolutions.com/cloud-pdf-converter/

What may be a better fit for you is the PDF text extraction and PDF image extraction functionality of JPedal. It's quite likely we will look at putting this in the cloud also, if the PDF to HTML5 goes well.

Text Extraction: http://www.idrsolutions.com/pdf-to-text-conversion/

Image Extraction: http://www.idrsolutions.com/extract-images-from-pdf/