Converting djvu to pdf AND preserving table of contents , how is it possible?
update: user3124688 has coded up this process in the script dpsprep.
I don't know of any tools that will do the conversion for you. You certainly should be able to do it, but it might take a little work. I'll outline the basic process. You'll need the open source command line utilities pdftk
and djvused
(part of DjVuLibre). These are available from your package manager (GNU/Linux) or their websites (Windows, OS X).
-
step 1: convert the file text
First, use any tool to convert the DJVU file to a PDF (without bookmarks).
Suppose the files are called
filename.djvu
andfilename.pdf
. -
step 2: extract DJVU outline
Next, output the DJVU outline data to a file, like this:
djvused "filename.djvu" -e 'print-outline' > bmarks.out
This is a file listing the DJVU documents bookmarks in a serialized tree format. In fact it's just a SEXPR, and can be easily parsed. The format is as follows:
file ::= (bookmarks <bookmark>*) bookmark ::= (name page <bookmark>*) name ::= "<character>*" page ::= "#<digit>+"
For example:
(bookmarks ("bmark1" "#1") ("bmark2" "#5" ("bmark2subbmark1" "#6") ("bmark2subbmark2" "#7")) ("bmark3" "#9" ...))
-
step 3: convert DJVU outline to PDF metadata format
Now, we need to convert these bookmarks into the format required by PDF metadata. This file has format:
file ::= <entry>* entry ::= BookmarkBegin BookmarkTitle: <title> BookmarkLevel: <number> BookmarkPageNumber: <number> title ::= <character>*
So our example would become:
BookmarkBegin BookmarkTitle: bmark1 BookmarkLevel: 1 BookmarkPageNumber: 1 BookmarkBegin BookmarkTitle: bmark2 BookmarkLevel: 1 BookmarkPageNumber: 5 BookmarkBegin BookmarkTitle: bmark2subbmark1 BookmarkLevel: 2 BookmarkPageNumber: 6 BookmarkBegin BookmarkTitle: bmark2subbmark2 BookmarkLevel: 2 BookmarkPageNumber: 7 BookmarkBegin BookmarkTitle: bmark3 BookmarkLevel: 1 BookmarkPageNumber: 9
Basically, you just need to write a script to walk the SEXPR tree, keeping track of the level, and output the name, page number and level of each entry it comes to, in the correct format.
-
step 4: extract PDF metadata and splice in converted bookmarks
Once you've got the converted list, output the PDF metadata from your converted PDF file:
pdftk "filename.pdf" dump_data > pdfmetadata.out
Now, open the file and find the line that begins:
NumberOfPages:
insert the converted bookmarks after this line. Save the new file as
pdfmetadata.in
-
step 5: create PDF with bookmarks
Now we can create a new PDF file incorporating this metadata:
pdftk "filename.pdf" update_info "pdfmetadata.in" output out.pdf
The file
out.pdf
should be a copy of your PDF with the bookmarks imported from the DJVU file.
Based on the very clear outline above given by user @pyrocrasty (thank you!), I have implemented a DJVU to PDF converter which preserves both OCR'd text and the bookmark structure. You may find it here:
https://github.com/kcroker/dpsprep
Acknowledgements for the OCR data go to @zetah on the Ubuntu forums!