Converting DJVU to PDF
I want to convert a DJVU document into a PDF document, separating and preserving the text layer and the images while also keeping the structure from the DJVU. How can I do this in Ubuntu?
(I will then be using Calibre to convert to ePub/Mobi, so if there were a Calibre plug-in for this entire process that would be perfect for me!)
Note1: Printing from Evince, exporting from DJview, or anything using the package ddjvu, are not adequate solutions as they discard the text layer, saving only images.
Note2: Using DJVULibre seems to only extract the text layer and pictures are not extracted. Similarly, copying the text "manually" loses the both document structure and the pictures.
Method 1
Simply use DJView and export as PDF
- Goto Synaptic Package Manager
- Install DJview4
- Run DJview (Applications - Graphics - DJView4)
- Open your .djvu document
- : Menu - Export As: PDF
Method 2
Open the djvu file in evince
Select print ----> print to file
change .ps to .pdf and click print
Method 3
- Goto Synaptic Package Manager
-
Install
djvulibre-bin libdjvulibre21 okular-extra-backends evince libevdocument3 libevview3
-
Goto terminal and write
sudo apt-get install libtiff-tools
Goto the directory where the djvu file is present. Click the right mouse button. Goto “Open In Terminal” option. Click on it. A terminal will open.
-
In that terminal write
ddjvu -format=tiff file_name.djvu file_name.tiff tiff2pdf -j -o file_name.pdf file_name.tiff
Method 4
There is also an online converter DjVu to PDF converter
Here is one way, which would require some not so common tools:
- ocrodjvu
- pdfbeads, that has it's own requirements which can be found by Google
We can use djvu2hocr
command (from ocrodjvu
package) to extract hidden text layer from DjVu file (it doesn't do any OCR or similar, it just extracts text layer with geometry), i.e.:
djvu2hocr -p 10 sample.djvu | sed 's/ocrx/ocr/g' > pg10.html
sed
intervention corrects class names in output hOCR (which is just simple HTML file)
Now we extract DjVu page to TIFF format with:
ddjvu -format=tiff -page=10 sample.djvu pg10.tif
so that we end with these file in out work folder:
sample.djvu
pg10.html
pg10.tif
This is where pdfbeads
comes in play, and we simple execute:
pdfbeads -o pg10.pdf
then this nifty program takes care of everything that's inside this folder (HTML and TIFF files with same base name) and produces output PDF file with some by-products:
sample.djvu
pg10.html
pg10.tif
pg10.jbig2
pg10.pdf
pg10.sym
which is identical to input DjVu file and has text layer inside:
Comments summary:
Lengthy comments below discuss representing smaller images from DjVu document page as separate objects, which is not easily possible because DjVu document page is itself just a single image with optional text layer, with no "information" about smaller images as separate objects. If DjVu document has color images, then they'll be usually placed on background layer; in this case user can take advantage of tools like ddjvu
(extract only background layer) and imagemagick
(auto-crop) to output just images instead whole canvas, but it can't be automated for creating PDF output
Another saner, but slower approach is use of regular OCR GUI tools. gscan2pdf
(> 1.0) is suggested as possible candidate for Linux PC
There is djvu2pdf but it relies on ghostscript so it might be another printing option. I still suggest you give it a look, just in case it's more clever than I'm giving it credit.
It's not in the repos but you can download a deb from the makers' site: http://0x2a.at/s/projects/djvu2pdf
** Insert mandatory notice about downloading/installing things from outside the repos here **