I'm trying to copy text from a PDF, but I get garbage

I'm trying to copy text from a PDF file, but I get garbage. I'm using Document Reader on Ubuntu to read the document. It's not like its not allowing me to copy, but just that the copied text looks like this:

RFRPSLOHJFFDUSVQLIIHUFRDUSVQLIIOSFDS    

5XQDVURRW

LQFOXGHSFDSK!
LQFOXGHVWGOLEK!
LQFOXGHVWULQJK!

$53+HDGHUDVVXPLQJ(WKHUQHW,3Y

GH¿QH$53B5(48(67
$535HTXHVW

GH¿QH$53B5(3/<
$535HSO\

W\SHGHIVWUXFWDUSKGU^
XBLQWBWKW\SH
+DUGZDUH7\SH

XBLQWBWSW\SH
3URWRFRO7\SH

XBFKDUKOHQ
+DUGZDUH$GGUHVV/HQJWK

XBFKDUSOHQ
3URWRFRO$GGUHVV/HQJWK

XBLQWBWRSHU
2SHUDWLRQ&RGH

XBFKDUVKD>@
6HQGHUKDUGZDUHDGGUHVV

XBFKDUVSD>@
6HQGHU,3DGGUHVV

XBFKDUWKD>@
7DUJHWKDUGZDUHDGGUHVV

XBFKDUWSD>@
7DUJHW,3DGGUHVV

What can I do to fix this? its a large amount of data will take a really long time to type.

Also, incidentally, the pasted text looked like this on gedit (Ubuntu):

on my system (notice that it looks different when pasted here in this question!)

I sense it is somehow an encoding problem, but I have no way of knowing how to fix this.

Solution 1:

The underlying text is garbled. I think @skub is correct to think that it may be on purpose. One way to get the text would be to export each page as an image (e.g. .jpg or .png) and then scan the images with OCR software. I was able to test this on Windows 7 with Adobe Acrobat X; it worked.

Update:

If your document viewer has a similar feature, copy with formatting copies the text as expected. Digging deeper, I can confirm that the embedded fonts all have a custom encoding.

I'm trying to copy text from a PDF, but I get garbage

Solution 1:

Update:

Related

Recent Posts