Why does the text `fi` get cut when I copy from a PDF or print a document?

When I copy from an Adobe Reader PDF file that contains

Define an operation

I rather see

Dene an operation

when I paste the text, why is this?

How can I remedy this annoying problem?

I've also seen this occur in the past when I've print a Microsoft Office Word file to my printer.


Solution 1:

This sounds like a font issue. The PDF is probably using the OpenType fi ligature in the word define, and the current font of the destination application is missing that glyph.

I don't know if there's an easy way to get Acrobat to decompose the ligature on copy.

Your problems with printing are probably also font-related. Something's probably allowing the printer to substitute the document's font with its own built-in fonts and the printer's version of the font is also missing that particular glyph. You'd have to tell Windows to always download fonts to the printer to work around this problem.

Another possibility when printing: UniScribe may not be enabled. MS KB 2642020 talks about this and some possible workarounds (namely, to use RAW type printing rather than EMF type printing). Though the context is slightly different than your specific problem, the cause may be the same and the same workarounds may apply.

Solution 2:

The issue here is, as the other answer notes, with ligatures. However, it has nothing at all to do with OpenType. The fundamental problem is that PDFs are a pre-print format that concerns itself only little with contents and semantics but instead is geared towards faithfully representing a page as it would be printed.

Text is laid out not as text but as runs of glyphs from a font at certain positions. So you get something like »Place glyph number 72 there, glyph number 101 there, glyph number 108 there, ...«. On that level there is fundamentally no notion of text at all. It's just a description how it looks. There are two problems extracting meaning from a bunch of glyphs:

  1. The spatial layout. Since PDF already contains specific information where to place each glyph there is no actual text underlying it as would be normal. Another side-effect is that there are no spaces. Sure, if you look at the text there are, but not in the PDF. Why emit a blank glyph when you could just emit none at all? The result is the same, after all. So PDF readers have to carefully piece together the text again, inserting a space whenever they encounter a larger gap between glyphs.

  2. PDF renders glyphs, not text. Most of the time the glyph IDs correspond with Unicode code points or at least ASCII codes in the embedded fonts, which means that you often can get ASCII or Latin 1 text back well enough, depending on who created the PDF in the first place (some garble everything in the process). But often even PDFs that allow you to get out ASCII text just fine will mangle everything that is not ASCII. Especially horrible with complex scripts such as Arabic which contain only ligatures and alternate glyphs after the layout stage which means that Arabic PDFs almost never contain actual text

The second problem is like the one you face. A common culprit here is LaTeX which utilises an estimated number of 238982375 different fonts (each of which is restricted to 256 glyphs) to achieve its output. Different fonts for normal text, math (uses more than one), etc. make things very difficult, especially as Metafont predates Unicode by almost two decades and thus there never was a Unicode mapping. Umlauts are also rendered by a diaeresis superimposed on a letter, e.g. you get »¨a« instead of »ä« when copying from a PDF (and of course cannot search for it either).

Applications producing PDFs can opt to include the actual text as metadata. If they don't, you're left at the mercy of how the embedded fonts are handled and whether the PDF reader can piece together the original text again. But »fi« being copied as a blank or not at all is usually a sign of a LaTeX PDF. You should paint Unicode characters on stones and throw them at the producer, hoping they will switch to XeLaTeX and thus finally arriving in the 1990s of character encodings and font standards.