I'm trying to copy text from a PDF, but I get garbage
I'm trying to copy text from a PDF file, but I get garbage. I'm using Document Reader on Ubuntu to read the document. It's not like its not allowing me to copy, but just that the copied text looks like this:
RFRPSLOHJFFDUSVQLIIHUFRDUSVQLIIOSFDS
5XQDVURRW
LQFOXGHSFDSK!
LQFOXGHVWGOLEK!
LQFOXGHVWULQJK!
$53+HDGHUDVVXPLQJ(WKHUQHW,3Y
GH¿QH$53B5(48(67
$535HTXHVW
GH¿QH$53B5(3/<
$535HSO\
W\SHGHIVWUXFWDUSKGU^
XBLQWBWKW\SH
+DUGZDUH7\SH
XBLQWBWSW\SH
3URWRFRO7\SH
XBFKDUKOHQ
+DUGZDUH$GGUHVV/HQJWK
XBFKDUSOHQ
3URWRFRO$GGUHVV/HQJWK
XBLQWBWRSHU
2SHUDWLRQ&RGH
XBFKDUVKD>@
6HQGHUKDUGZDUHDGGUHVV
XBFKDUVSD>@
6HQGHU,3DGGUHVV
XBFKDUWKD>@
7DUJHWKDUGZDUHDGGUHVV
XBFKDUWSD>@
7DUJHW,3DGGUHVV
What can I do to fix this? its a large amount of data will take a really long time to type.
Also, incidentally, the pasted text looked like this on gedit (Ubuntu):
(notice that it looks different when pasted here in this question!)
I sense it is somehow an encoding problem, but I have no way of knowing how to fix this.
Solution 1:
The underlying text is garbled. I think @skub is correct to think that it may be on purpose. One way to get the text would be to export each page as an image (e.g. .jpg or .png) and then scan the images with OCR software. I was able to test this on Windows 7 with Adobe Acrobat X; it worked.
Update:
If your document viewer has a similar feature, copy with formatting
copies the text as expected. Digging deeper, I can confirm that the embedded fonts all have a custom encoding.