How can I copy this quote from PDF? [duplicate]
Possible Duplicate:
PDF has garbled text when copy pasting
I'm reading a PDF copy of Jerome H. Friedman's paper "Data Mining and Statistics: What's the Connection?" using Google Chrome.
It contains an amusing quote that I want to copy and paste to my blog.
I used the mouse to select the text of the quote and pressed CTRL + C to copy the text. The document looks like this:
When I paste the text into Notepad, Stack Overflow, or anywhere else, the product is Wingdings-like gibberish:
➣✍❺❼⑤➭✸❸❊➁❥❸❊⑥▼❽❾❸✘➎✳❸❾②❘➊➥❸❊❸❊⑥❦⑨❘②③✇▲➆ ②❥⑤⑩⑨❘②❥⑤⑩❽❾⑤⑩✇➄⑥▼⑨❏✇➄⑥▼❺➌❽❾❻➀➍♣➂⑦❶❼②❥❸❊➁❷⑨❥❽❾⑤❸❊⑥✗②❥⑤⑩⑨❘②③⑨✘⑤⑥☎②❥➇⑦⑤⑩⑨ ➔❸❊➅⑩❺➌⑨❹❸❊❸❊➍P⑨①②❥❻ ➎✳❸❏②❥➇▼✇▲②➟➊❚➇⑦❸❊⑥✆✇P⑨❘②③✇▲②❥⑤⑩⑨❘②❥⑤⑩❽❾⑤⑩✇➄⑥❦➇▼✇➀⑨↔✇➄⑥❦⑤⑩❺❼❸✶✇♣➇⑦❸❷❻➀➁↔⑨❹➇⑦❸❷➊❚➁❥⑤②❥❸✶⑨ ✇❨➂▼✇➄➂✳❸❊➁✶Þ⑦✇♣❽❾❻➀➍♣➂⑦❶❼②❥❸❊➁➟⑨❥❽❾⑤❸❊⑥✗②❥⑤⑩⑨❘②↔⑨❘②③✇➄➁❹②③⑨❚✇♣❽❾❻➀➍♣➂▼✇➄⑥☛➧➀➏
The text should instead look like this:
A difference between statisticians and computer scientists in this field seems to be that when a statistician has an idea he or she writes a paper; a computer scientist starts a company.
I had to type that text out manually. This is feasible for such a small quote, but how do I actually copy what I see?
Is it something unusual about the PDF, the browser, the plugin, or some combiniation of the three?
Solution 1:
Most reliable way of doing it is by using OCR.
But as a dirty and fast solution you can use Google Quick View from the search result for your link, in Quick view use option View > Plain HTML.
It still contains some garbled text and is quite unreadable but a large amount of text is correct and copy-able. Search works here so you can use it to locate the target text and copy it without any garbled text.
Detailed Example here:
Then use View option Plain HTML.
On Google's HTML version, you can search and select the equivalent text like this:
Pasting into Notepad produces this output:
A difference between sta-tisticians and computer scientists in this field seems tobe that when a statistician has an idea he or she writesa paper; a computer scientist starts a company.
Not exactly as displayed, but close enough that you can work with it.
Solution 2:
You'll have to discard the corrupted text that's already associated with the PDF before you can re-OCR it. The easiest way to do that is to save it in TIFF format, then open it with Acrobat and re-OCR it. When I did that, it worked for me.
Solution 3:
Looks like a PDF with incorrect encoding. See the following threads:
Copy text from a PDF to word. Just get Symbols
PDF has garbled text when copy pasting
search PDFs with non-standard character encodings
Try printing the PDF using CutePDF, then see if the resulting PDF is any better.