PDF: detect and crop multiple pages?
I used a high-speed scanner at my university to scan some sections of a book into a PDF. The PDF file produced by the scanner is simply images taken by the document camera stored as-is. So in other words, we have 30 PDF pages, which represent 60 print pages.
The machine is capable of large-scale scans, so its scanning area is much larger than a normal book. This means the images also have a LOT of border. The table is black and the pages are obviously white, so it would seem software should be able to automatically crop.
I'm looking for some sort of solution that can go through the PDF and extract out the two pages as well as remove the border around them, and produce a new PDF of the fixed results. In other words, I want a PDF of 60 pages, with borders removed. I plan to pass the processed PDF through ABBYY FineReader for OCR.
Does anyone have any ideas as to how this can be done?
These free tools look promising for your purposes: Scantailor or Bookscanner.
If you have access to Adobe Acrobat, that's how I've done it. The basic workflow would be to combine the images into a PDF, crop the extra black space from all the pages at once, duplicate each of the pages, crop the even and odd ones in two batches to cut them in half, and then OCR.