Unsearchable, uncopiable PDF document

pdf

This PDF probably contains its own font which is embedded into it. In this case, although the PDF will still display correctly, the correct text information is not always available and copying becomes impossible.

The fonts actually are all embedded, but in a way that all encoding information has been removed. This happens when a PDF that is still syntactically fully compliant with the PDF spec had important information about the meaning of the text in it thrown away during the process of making the PDF. It is very difficult to recover the encoding info, and sometimes the best option is to convert the pages to TIFF and then run OCR ...

You can try a PDF to Word Converter, such as AnyBizSoft or a website converter. After conversion, you can get whatever you want from the word or text file. Here is a step by step tutorial for AnyBizSoft. (AnyBizSoft is recommended by many, but I have never used it personally.)

See also Best Free PDF Tools for more tools and converters.

With Adobe Acrobat Pro 9, I process the problematic unsearchable pdf due to custom font encoding as follows: All of these commands (1-4) are from File menu:

Open the pdf
List item
Export -> image -> jpg
Create PDF -> batch create multiple files
Combine -> Merge files into single PDF

From Document Menu:

scan OCR {this is to create a pdf image file searchable}.

But a 258 pages of pdf doc of 1457 KB size after converted using the above steps, it becomes a pdf file of 67565 KB. It becomes far bigger size! but it is now searchable.

Best solution yet! If you don't mind a little loss of quality...

1) Print your unsearchable pages to PDF using a rasterizing third-party PDF printer (Win2PDF worked for me). The end result is essentially a scan of the original PDF, stripped of font data.

2) Run Optical Character Recognition (OCR) text recognition tool from the Document (top bar) menu.

The end result is a searchable (albeit scruffy-looking) PDF. Something about OCR roughs up the characters in the file. But they are indeed searchable and copy/pasteable.

Enjoy.

I would like to contribute step by step instructions. The answers above did help me with the exact same problem, but there are a lot of steps missing. One thing that tripped me up (for weeks!) was finding a match on the symptoms. So, for newbies like me, I posted instructions on my own web site at http://supersaturated.com/howToFixUnsearchablePDF.html, which I copy here:

THE SYMPTOMS:

I recently bought an e-book from someone who had used an older Mac OS to create them. The books opened just fine. I could see the words in them. But I could not search for words in the book. All the programs that I used to do this (Windows Explorer, Foxit Reader, Adobe Acrobat, LibreOffice, various web browsers, Evernote Premium) either told me that the word was not found, or just stared at me blankly as though I hadn't just told them to search. The only search query that got a response, was a search for a single letter or digit. However, I never found the letter or digit that I searched for; instead, I got a series of other characters one after another. For example, if I searched for the letter 'h', I would get, in succession: w, w, w, w w, m, m, m, m, 2, 2, m, m, m, f, f, f, f, etc. After maybe 30 times of hitting Search Again, whatever program I was using seemed to get bored with the game, because it would then take me back to the top of the document and start finding instances of 'w' again. My boyfriend opened the document using his Mac and his linux box, and he couldn't search it either.

Another symptom was that the text was uncopiable. I tried copying and pasting the text into various editors, but all that gave me was code.

I had no experience with manipulating pdf's and didn't know that, as a Windows 7 user, I owned the software to do it. As I scoured the web for the solution, I came across many almost-unintelligible (to me) explanations of the problem and what to do about it. In general, I found more explanations of why there was a problem, but forum discussions typically ended with the problem going unsolved. But the basic gist I got, is that there is a very kludgy work-around using Adobe Acrobat. This is a program I never use, because I've always hated it (and pdf's). I thought it was just a reader anyway, and a terribly awkward one at that.

So last night I got to know Adobe Acrobat. I had no idea what most of the menu items did, so I just tried everything and failed until something worked.

ONE SOLUTION:

To save you from the same grief, here are step-by-step instructions. There may be other solutions; this is simply the first one that I found that I could do myself, without paying a web service or Kinkos to do it for twice what I paid for the e-book. If you don't happen to have Adobe Acrobat, you almost certainly have a friend who does. And there may be other pdf manipulators that can do the same thing (I looked hard but couldn't find a way to do it with Foxit or with Evernote, even though Evernote can read text from snapshots of your handwriting!

LAUNCH Adobe Acrobat
Using the File menu, OPEN the corrupted document. (I don't know what to do if you're not even able to open the file. Sorry!)
(VERIFY that Acrobat can't search the document, in case you haven't done so, just to avoid unnecessary work.)
EXPORT: Once the document is open, use the File menu again, and choose EXPORT / IMAGE / PNG. Your corrupted pdf will be saved as a series of images with the file extension ".pgn", one for each page of the pdf document. Don't worry, they will be numbered automatically by Acrobat, and they aren't terribly large. My document was 200 pages long, so I got 200 little image files in .png format. The export may take a couple of minutes. You won't get any further signals from Adobe to tell you it's done--just go look in the directory that contains the original and see if it made png files with names like:

chemistryBook_Page_001.png chemistryBook_Page_002.png

COLLECT: Once you have the image files, collect them all by cut and paste into their own directory.
OCR: Under the Document menu, choose OCR TEXT RECOGNITION / RECOGNIZE TEXT IN MULTIPLE FILES USING OCR
ADD FILES: You will be shown a dialog box with the title "Paper Capture Multiple Files" with the subtitle "Run OCR on a set of images. There is a button that says "Add Files". Click this button, choose ADD FOLDERS, and browse to the folder that contains your png files. Highlight that file, click OK. The files will then appear within this dialog box. Make sure that the files are in the proper order, or you will be sad. Click OK.
CHOOSE OUTPUT OPTIONS: Now you will get a dialog box entitled "Output Options". You have several choices to make here:

TARGET FOLDER: Click "Specific Folder", then Browse to your folder full of images, click "Make New Folder", name the folder (something like "CHEMISTRYBOOKIMAGEFILES" so you can find it easily and know what is in it, click OK.

FILE NAMING: Click "Keep Original File Names". This will preserve Acrobat's automatic numbering of your files--you will need that to get the page ordering right! UNcheck "Overwrite existing files" just to avoid a terrible mishap, unless you are very pressed for disc space or unless this is your 5th time attempting to follow these instructions and you've already got 'way too many duplicates of the output files. If you have the disc space, just make new empty folder for your 6th try.

OUTPUT FORMAT: Select "Save File(s) as Adobe PDF. Click OK.

Now wait for Adobe to execute optical character recognition on the image files. Its output will be one little pdf file for each little image file that it OCR's.

COLLATE THE FILES INTO ONE: Under the File menu, select COMBINE / MERGE FILES INTO A SINGLE PDF. This step is optional; maybe you wanted a bunch of little files, or maybe you wanted to divide your enormous original document into 2 or 3 more manageable documents. To divide the file, just make a separate directory for the png files you want in each smaller final document, and repeat steps 6 through 9 for each directory. BE CAREFUL WITH NAMING! Make sure you choose a unique name, because if you got something wrong, you will want to be able to go back to your original corrupted pdf and try again. If your original is named "CHEMISTRY.PDF", please remember to name this new file something like "CHEMISTRY-FIXED.PDF".

If you really despise pdf, you can try using different output formats in Step 8. I do hate pdf, but I chose pdf for two reasons: one is that I had more confidence that that pdf would retain important features like charts and graphs and labeled photos in my document. The other is that I was so so so so SO tired of doing all this pdf crap instead of the chemistry work that I'd gotten the ebook to help me with that I didn't want to do anything fancy with file formats at this point. Let me know if you try output to rtf or ascii and get good results.

TEST: Open the merged document(s) in all of the pdf readers and web browsers you will want to use with it, and try searching with it. Use your file browser and try to search for text in the directory with a word you know the file contains. Searchable? Good job, you're done, cheers!

Not searchable? Oh noes! Check that you opened the correct document (maybe you opened the original by mistake). Try the entire process again. If that doesn't help, try the entire process again, but output to plain text this time. My apologies, but, being a complete newbie myself, I have no further advice on this topic.

NB! My output PDF is of rather low quality. It looks like it was literally scanned from a 10th-iteration paper copy. Don't know how to fix that, after the fact or somewhere in the above process. It's good enough, so I'm just dealing with the shakey blurriness. I seem to remember somewhere that I could choose a high quality output, but, again, I didn't want to do anything fancy with vectors and rostering and layers and other terms that I don't know before I verified that I could do something basic and get back to chemistry asap.

My blog is not open to public comments. If you have questions, email me. My address is carolyn at my domain name. I will do my best to help you because I know how frustrating and crippling this problem can be, and I know how daunting this whole pretend-ocr process is.

Unsearchable, uncopiable PDF document

Related

Recent Posts