PDF has an extra blank in all words after running through Ghostscript

This PDF was produced by Abbyy Finereader 10:

http://ebooks.zeitr.org/from_abbyy.pdf

You can copy & paste the first sentence and get this (very good) text result:

Der »Bund Deutscher Gymnastik-Schulleiter« wurde am 20. November 1955 anläßlich einer Zusammenkunft der Leiterinnen und Leiter der privaten deutschen Gymnastik-Ausbildungsstätten gegründet.

After some processing with Ghostscript 9.02 (64 bit Windows) I get this file:

http://ebooks.zeitr.org/after_ghostscript.pdf

Now the first sentence looks strange - there is an extra space before the last character of each word.

Der »Bun d Deutsche r GymnastikSchulleiter « wurd e a m 20 . Novembe r 195 5 anläßlic h eine r Zusammenkunf t der Leiterinne n un d Leite r de r private n deutsche n GymnastikAusbildungsstätte n gegründet .

This has the main negative effect that you cannot search for whole words in Acrobat Reader. I can reproduce the effect with the following minimal parameter set for Ghostscript:

-sDEVICE=pdfwrite ^
-dBATCH ^
-dNOPAUSE ^
-sstdout="myStdOut" ^
-sOutputFile="myDestFile.pdf" ^
 mySourceFile.pdf

Any ideas?


I found this an interesting problem and had a closer look...

First, I used the qpdf commandline tool to un-compress PDF data streams so I could better see the source codes of both files:

qpdf.exe ^
   --qdf ^
     from_abbyy.pdf ^
     qdf--from_abbyy.pdf

qpdf.exe ^
   --qdf ^
     after_ghostscript.pdf ^
     qdf--after_ghostscript.pdf

Looking at one of the first occurrences where an extra space gets inserted (it is the original string "Bund Deutscher Gymnastik-Schulleiter" turning into "Bun d Deutsche r GymnastikSchulleiter"), I find the following PDF snippets:

In qdf--from_abbyy.pdf:

( Deutsche) Tj
0 Tc
(r) Tj
1 0 0 1 143.236 265.140 Tm     %% Tm = 'text matrix' operator
3.569 Tw
0.706 Tc
( Gymnastik-Schulleite) Tj

In qdf--after_ghostscript.pdf:

( Deutsche)Tj
0 Tc
36.235 0 Td                    %% extra Td = 'move text current point' operator
(r)Tj
2.16501 0 Td                   %% Td = 'move text current point' instead of Tm
3.569 Tw
0.706 Tc
( Gymnastik-Schulleite)Tj

To give you a little idea what the PDF graphic operators used here do mean, here is a short list:

Tj - show text
Tc - set character spacing
Tm - set text matrix
Tw - set word spacing
Td - move text current point

As you can see, Ghostscript replaced the original Tm (text matrix) operator by a Td (move text current point) one, and it also added an extra 2.16501 0 Td... I don't know why this is. I'll submit a bug report to Ghostscript's bugzilla [*] and see if they are interested in solving it.

Note however, that this problem does not occur, if I use the Linux Acrobat Reader 9.4.2 and use the menu action "File -> Save as Text...". In this case, there are no additional spaces (but a few extra linebreaks). On Linux also, the text is not correctly searchable, and also shows the extra spaces when doing copy'n'paste....


[*] I'll update here with the bug number when I've done it.


Update:

After pondering a bit more about the replaced Tm operator, I now think this shouldn't be the root of the problem.

When realizing that, I did try to make the conversion with Ghostscript v8.71 instead of v9.02. And what should I say? The copy'n'paste problem does not occur with v8.71 output!

That means: there is a problem in Ghostscript 9.02 that wasn't there in 8.71. Most likely it has to do with the font metrics embedded in the output PDF. Because the above quoted PDF snippets are the same in v8.71 output as in v9.02 output....

Update 2:

URL of bug entry in Ghostscript's bugzilla:

  • http://bugs.ghostscript.com/show_bug.cgi?id=692206

Update 3:

This bug does seem to have been fixed meanwhile. I do not see it happen with the Ghostscript versions I've again tested it with: current Git (v9.10GIT) nor with Ghostscript v9.06.


If you scan a page with text into a PDF and run an OCR application on it, then the text will be added to the page, but the "text rendering mode" is set to invisible. It's there, but it's not rendered on screen (or on paper if printed). What you see or print is the original scanned image.

How can we make the invisible text visible?

Well, we can edit the PDF... The PDF code to set text rendering to invisible is this:

3 Tr

You cannot find this string (yet) in the original from_abbyy.pdf nor in from_ghostscript.pdf because parts of the PDFs are compressed. So we uncompress them as far as possible with the help of qpdf:

qpdf \
 --qdf \
   from_abbyy.pdf \
   qdf--from_abbyy.pdf

qpdf \
 --qdf \
   after_ghostscript.pdf \
   qdf--after_ghostscript.pdf

Now we can find above string easily (and there is only one occurrence in each file).

Let's switch this to one of the visible modes of text rendering. Overall, we can choose amongst these 8 text rendering modes:

 0 -  fill glyph shapes
 1 -  stroke glyph shapes
 2 -  fill, then stroke glyph shapes
 3 -  neither fill nor stroke glyph shapes (invisible)
 4 -  fill and add to path for clipping glyph shapes
 5 -  stroke glyph shapes and add to path for clipping
 6 -  fill, then stroke glyph shapes and add path for clipping
 7 -  add glyph shapes to path for clipping

If I use the "fill" mode, the text from the OCR will probably look not so good on top of the underlying scan image. Therefore I prefer the "stroke" variant. So I simply change above line to read

 1 Tr

Looking at this modified PDF, I don't like it, because the default linewidth is too thick for my taste. Also, the color of the outline stroke is black (default); I'd prefer red so to have a contrast to the originally scanned shapes. Therefore I add some code to the front of this line which sets the linewidth to a quarter of a point:

 .25 w

and some other to set the stroke color to red:

 1 0 0 RG

The complete line now is reads:

 .25 w 1 0 0 RG 1 Tr

That's all.

Note, that our little manipulation has damaged the file, because its "TOC" (in technical terms: its xref table) will now no longer be valid. Acrobat Reader or Acrobat Professional will nevertheless still open it (without complaining even) and silently "repair" the xref section of the file. Other PDF viewers may reject the file, but for now we don't care...

Here are screenshots of the result: zoomed to window width(First screenshot is zoomed to window width.) zoomed to 800%(Second screenshot is zoomed to 800%.)

The red outlines is the scanned text made visible now, just as we wanted it.

I conducted the same procedure as outlined above for both files from_abbyy.pdf and after_ghostscript.pdf. I opened both results in 2 different instances of Acrobat Reader. If we make them both zoom to the same value and maximize both windows, then it is easy to toggle the view between both files via [alt]+[tab]. This is a good way to reveal even the finest rendering differences between two PDF files.

My result is: there is not even a single pixel different between Ghostscript's (v9.02) input and its output for this file. But there is quite a difference if you want to copy'n'paste text...


I don't see the described problem. I opened the 'after' PDF file with Acrobat Professional 9.0 and the text is copied and pasted correctly.

Ghostscript fully interprets the PDF file, and produces a new PDF file based on what it interpreted, it has no relationship with the original file other than that it records the position of the text.

Because of the rich feature set of PDF it is possible to have characters positioned in the same place by using multiple different methods. So there's nothing wrong or unexpected per se in the way that GS is producing the PDF file.

Given that the text can be saved out correctly, this is a matter of the Acrobat heuristics deciding whether or not two 'nearby' characters are adjacent or have a space between, when handled as consecutive ASCII.

I don't believe the problem can be the embedded font metrics for the simple reason that the font is not embedded :-) The font being used is Helvetica, which is not embedded in the document, and so Acrobat (for me at least) uses ArialMT. Note that the 'original' PDF file also does not contain the fonts.

I will eventually look at the reported bug, but it won't be soon and I doubt there is anything we can (or will) do about it. It seems to me this is an inevitable consequence of heuristics. It might help to embed the fonts though, so that at least they would be consistent.


From the Ghostscript bug report at:

http://bugs.ghostscript.com/show_bug.cgi?id=692206


I have now been able to reproduce the issue, and it is not a regression from 8.71, its a progression (and an Adobe change).

8.71 shipped with a bug which caused it to write invalid ToUnicode CMaps. Misleading and contradictory Adobe documentation led to the CMap being written as a CMap, when in fact ToUnicode CMaps have their own, incompatible, rules.

ToUnicode CMaps are normally only used for searching and copy/paste. As the name implies they are used to map character codes to Unicode code points. The ToUnicode CMap in the 8.71 PDF file is not used, because it is invalid, the one in later versions is valid, and Acrobat is known to use it.

It appears that in Acrobat Reader up to and including 9.2 the existence of the ToUnicode data makes no difference. At some point after 9.2 the search mechanism, changed, and Acrobat appears to use two different mechanisms depending on whether a ToUnicode CMap is present. I don't have access to Acrobat Pro after 9.2 and only recently installed Reader X, I have nothing between.

The 'no Unicode' method works on all versions of Acrobat, the 'Unicode' method fails on newer versions.

I showed this by white spacing the reference to the ToUnicode CMap from the FontDescriptor. If required I can make the various files available, but they are large as they are decompressed.

Since search is a heuristic effort in PDF it is not going to be possible to guarantee a result. The change in behaviour is due to Acrobat, not Ghostscript, and the change in Ghostscript was to fix a real bug, so a progression, not a regression.