Cutting & Pasting Vietnamese characters from a PDF

Solution 1:

It is because the encoding used in the PDF is arbitrary.

Acrobat File properties

From Some PDF in Vietnamese I found in the intertubes

"Encoding:Custom" probably means a (random seeming) encoding made up for it's own convenience by the program that produced this PDF.

"Embedded Subset" means The program didn't need a huge number of characters from this font so it just picked the few it needed and arranged them in seemingly random order (maybe the order the program encountered them in the text) and the newly invented encoding is based on this ordering.

Its not really "characters". Basically the PDF no longer has any universally meaningful information about "which character" it has. It just has an indexed bunch of shapes and a list of positions and sizes where it displays those indexed shapes.


Wikipedia says

CID-keyed fonts may be made without reference to a character collection by using an "identity" encoding, such as Identity-H (for horizontal writing) or Identity-V (for vertical). Such fonts may each have a unique character set, and in such cases the CID number of a glyph is not informative; generally the Unicode encoding is used instead, potentially with supplemental information.

So you might try to see if it makes sense in say UTF-16 BE encoding.