Converting DJVU to PDF

I want to convert a DJVU document into a PDF document, separating and preserving the text layer and the images while also keeping the structure from the DJVU. How can I do this in Ubuntu?

(I will then be using Calibre to convert to ePub/Mobi, so if there were a Calibre plug-in for this entire process that would be perfect for me!)

Note1: Printing from Evince, exporting from DJview, or anything using the package ddjvu, are not adequate solutions as they discard the text layer, saving only images.

Note2: Using DJVULibre seems to only extract the text layer and pictures are not extracted. Similarly, copying the text "manually" loses the both document structure and the pictures.


Method 1

Simply use DJView and export as PDF

  1. Goto Synaptic Package Manager
  2. Install DJview4
  3. Run DJview (Applications - Graphics - DJView4)
  4. Open your .djvu document
  5. : Menu - Export As: PDF

Method 2

Open the djvu file in evince
Select print ----> print to file
change .ps to .pdf and click print

Method 3

  1. Goto Synaptic Package Manager
  2. Install

    djvulibre-bin libdjvulibre21 okular-extra-backends evince libevdocument3 libevview3

  3. Goto terminal and write

     sudo apt-get install libtiff-tools
    
  4. Goto the directory where the djvu file is present. Click the right mouse button. Goto “Open In Terminal” option. Click on it. A terminal will open.

  5. In that terminal write

    ddjvu -format=tiff file_name.djvu file_name.tiff
    tiff2pdf -j -o file_name.pdf file_name.tiff
    

Method 4

There is also an online converter DjVu to PDF converter


Here is one way, which would require some not so common tools:

  1. ocrodjvu
  2. pdfbeads, that has it's own requirements which can be found by Google

We can use djvu2hocr command (from ocrodjvu package) to extract hidden text layer from DjVu file (it doesn't do any OCR or similar, it just extracts text layer with geometry), i.e.:

djvu2hocr -p 10 sample.djvu | sed 's/ocrx/ocr/g' > pg10.html

sed intervention corrects class names in output hOCR (which is just simple HTML file)

Now we extract DjVu page to TIFF format with:

ddjvu -format=tiff -page=10 sample.djvu pg10.tif

so that we end with these file in out work folder:

sample.djvu
pg10.html
pg10.tif

This is where pdfbeads comes in play, and we simple execute:

pdfbeads -o pg10.pdf

then this nifty program takes care of everything that's inside this folder (HTML and TIFF files with same base name) and produces output PDF file with some by-products:

sample.djvu
pg10.html
pg10.tif
pg10.jbig2
pg10.pdf
pg10.sym

which is identical to input DjVu file and has text layer inside:

enter image description here

Comments summary:

Lengthy comments below discuss representing smaller images from DjVu document page as separate objects, which is not easily possible because DjVu document page is itself just a single image with optional text layer, with no "information" about smaller images as separate objects. If DjVu document has color images, then they'll be usually placed on background layer; in this case user can take advantage of tools like ddjvu (extract only background layer) and imagemagick (auto-crop) to output just images instead whole canvas, but it can't be automated for creating PDF output

Another saner, but slower approach is use of regular OCR GUI tools. gscan2pdf (> 1.0) is suggested as possible candidate for Linux PC


There is djvu2pdf but it relies on ghostscript so it might be another printing option. I still suggest you give it a look, just in case it's more clever than I'm giving it credit.

It's not in the repos but you can download a deb from the makers' site: http://0x2a.at/s/projects/djvu2pdf

** Insert mandatory notice about downloading/installing things from outside the repos here **