How can I copy this quote from PDF? [duplicate]

Possible Duplicate:
PDF has garbled text when copy pasting

I'm reading a PDF copy of Jerome H. Friedman's paper "Data Mining and Statistics: What's the Connection?" using Google Chrome.

It contains an amusing quote that I want to copy and paste to my blog.

I used the mouse to select the text of the quote and pressed CTRL + C to copy the text. The document looks like this:

A highlighted quote from Jerome's paper.

When I paste the text into Notepad, Stack Overflow, or anywhere else, the product is Wingdings-like gibberish:

➣✍❺❼⑤➭✸❸❊➁❥❸❊⑥▼❽❾❸✘➎✳❸❾②❘➊➥❸❊❸❊⑥❦⑨❘②③✇▲➆ ②❥⑤⑩⑨❘②❥⑤⑩❽❾⑤⑩✇➄⑥▼⑨❏✇➄⑥▼❺➌❽❾❻➀➍♣➂⑦❶❼②❥❸❊➁❷⑨❥❽❾⑤❸❊⑥✗②❥⑤⑩⑨❘②③⑨✘⑤⑥☎②❥➇⑦⑤⑩⑨ ➔❸❊➅⑩❺➌⑨❹❸❊❸❊➍P⑨①②❥❻ ➎✳❸❏②❥➇▼✇▲②➟➊❚➇⑦❸❊⑥✆✇P⑨❘②③✇▲②❥⑤⑩⑨❘②❥⑤⑩❽❾⑤⑩✇➄⑥❦➇▼✇➀⑨↔✇➄⑥❦⑤⑩❺❼❸✶✇♣➇⑦❸❷❻➀➁↔⑨❹➇⑦❸❷➊❚➁❥⑤②❥❸✶⑨ ✇❨➂▼✇➄➂✳❸❊➁✶Þ⑦✇♣❽❾❻➀➍♣➂⑦❶❼②❥❸❊➁➟⑨❥❽❾⑤❸❊⑥✗②❥⑤⑩⑨❘②↔⑨❘②③✇➄➁❹②③⑨❚✇♣❽❾❻➀➍♣➂▼✇➄⑥☛➧➀➏

The text should instead look like this:

A difference between statisticians and computer scientists in this field seems to be that when a statistician has an idea he or she writes a paper; a computer scientist starts a company.

I had to type that text out manually. This is feasible for such a small quote, but how do I actually copy what I see?

Is it something unusual about the PDF, the browser, the plugin, or some combiniation of the three?


Solution 1:

Most reliable way of doing it is by using OCR.

But as a dirty and fast solution you can use Google Quick View from the search result for your link, in Quick view use option View > Plain HTML.

It still contains some garbled text and is quite unreadable but a large amount of text is correct and copy-able. Search works here so you can use it to locate the target text and copy it without any garbled text.


Detailed Example here:
Google search results for URL includes Quick View link.
Then use View option Plain HTML.
The Quick View has an options to view the document as HTML.
On Google's HTML version, you can search and select the equivalent text like this:
Search the HTML verion to find and select the relevant quote.
Pasting into Notepad produces this output:

A difference between sta-tisticians and computer scientists in this field seems tobe that when a statistician has an idea he or she writesa paper; a computer scientist starts a company.

Not exactly as displayed, but close enough that you can work with it.

Solution 2:

You'll have to discard the corrupted text that's already associated with the PDF before you can re-OCR it. The easiest way to do that is to save it in TIFF format, then open it with Acrobat and re-OCR it. When I did that, it worked for me.

Solution 3:

Looks like a PDF with incorrect encoding. See the following threads:

  • Copy text from a PDF to word. Just get Symbols

  • PDF has garbled text when copy pasting

  • search PDFs with non-standard character encodings

Try printing the PDF using CutePDF, then see if the resulting PDF is any better.