Finding duplicate pages in pdf document
I have a pdf document that was created by concatenation of (a huge number) of little documents. For example, 1.pdf, 2.pdf and 3.pdf. The problem is that the last page of 1.pdf is also the first page of 2.pdf, and the past page of 2.pdf is also the first ... you get the idea.
So, after joining, I got the pdf document with a lot of duplicate pages. And the document has about 12000 pages (!!). Is there a way to automatically detect duplicate pages and remove them ?
Or any ideas how to make this a little easier ?
If your "identical" pages render into exactly the same visual appearance on screen, the following algorithmic approach could work to find out duplicates:
- Convert each page into a low-res TIFF or JPEG file using Ghostscript (f.e. using 72dpi).
- In case you use TIFF: run one of the libtiff commandline utilities to "normalize" the TIFF meta data.
- Run md5sum.exe on each TIFF or JPEG page and remember the Md5sum for each page.
- Sort the list of MD5sums to find the duplicate pages.
- Remember all duplicate page numbers to be deleted.
- Run a
pdftk.exe
commandline on the original PDF to remove the duplicates.
You could code this algorithm in any language you like (even batch on Windows or bash on Linux/Unix/MacOSX).
First: Some notes on using Ghostscript. Create your 1200 TIFF (or JPEG) pages (on Linux you'd use gs
instead of gswin32c
):
gswin32c.exe ^
-dBATCH -dNOPAUSE -dSAFER ^
-sDEVICE=tiffg4 ^
-sOutputFile=C:\temp\tiffs\page-%06d.tif ^
-r72x72 ^
12000pages.pdf ^
# use -sDEVICE=jpeg to create *.jpeg files + adapt -sOutputFile= accordingly
# page-%06d.tif creates TIFFs named page-000001.tif through page-012000.tif*
Second: Some notes on the requirement of using (the freely available) libtiff utilities. When Ghostscript creates a TIFF page, it will note its current version, date and time plus some other meta data inside the TIFF. This could botch your MD5 checking, because otherwise identical TIFFs may carry a different date/time stamp. Hence the need to "normalize" these. Use tiffinfo page-000001.tif
or tiffdump page-000001.tif
to see what I mean. You could see s.th. like this:
c:\downloads> tiffdump.exe page-000001.tif
page-000001.tif:
Magic: 0x4949 <little-endian> Version: 0x2a
Directory 0: offset 2814 (0xafe) next 0 (0)
SubFileType (254) LONG (4) 1<2>
ImageWidth (256) SHORT (3) 1<595>
ImageLength (257) SHORT (3) 1<842>
BitsPerSample (258) SHORT (3) 1<1>
Compression (259) SHORT (3) 1<4>
Photometric (262) SHORT (3) 1<0>
FillOrder (266) SHORT (3) 1<1>
StripOffsets (273) LONG (4) 8<8 341 1979 1996 2013 2030 2047 2064>
Orientation (274) SHORT (3) 1<1>
SamplesPerPixel (277) SHORT (3) 1<1>
RowsPerStrip (278) SHORT (3) 1<109>
StripByteCounts (279) LONG (4) 8<333 1638 17 17 17 17 17 13>
XResolution (282) RATIONAL (5) 1<72>
YResolution (283) RATIONAL (5) 1<72>
PlanarConfig (284) SHORT (3) 1<1>
Group4Options (293) LONG (4) 1<0>
ResolutionUnit (296) SHORT (3) 1<2>
PageNumber (297) SHORT (3) 2<0 0>
Software (305) ASCII (2) 21<GPL Ghostscript 8.71\0>
DateTime (306) ASCII (2) 20<2010:06:22 04:56:12\0>
Here is the command to "normalize" the date+time fields (which are tagged "306" in my case) in an example TIFF:
c:\downloads> tiffset -s 306 "0000:00:00 00:00:00" ex001.tif
As a result, the DateTime field now has changed:
c:\pa>tiffdump ex001.tif | findstr DateTime
DateTime (306) ASCII (2) 20<0000:00:00 00:00:00\0>
Now loop through all your TIFFs to normalize all their DateTime fields:
c:\downloads> for /l %i in (C:\temp\tiffs\*.tif) ^
do tiffset -s 306 "0000:00:00 00:00:00" %i
Third and Fourth: Run md5sum.exe and sort the list of files to find duplicates. Here is a commandline to use:
c:\downloads> md5sum.exe C:\temp\tiffs\*.tif | sort
As a result you should easily see which files/pages have the same MD5 hash. It will look similar to this:
c:\> md5sum.exe c:/temp/tiffs/page-0*.tif
[....]
fae9fa136c4f7ecca23b6a34d620fb02 *c:\temp\tiffs\page-000032.tif
fae9fa136c4f7ecca23b6a34d620fb02 *c:\temp\tiffs\page-000033.tif
fb5fef1732148d71bfff841c214cf836 *c:\temp\tiffs\page-000076.tif
fb5fef1732148d71bfff841c214cf836 *c:\temp\tiffs\page-000077.tif
fb86c1bdbc697eef7cb869f4e2e2957b *c:\temp\tiffs\page-000187.tif
fb86c1bdbc697eef7cb869f4e2e2957b *c:\temp\tiffs\page-000188.tif
fbb801ab3ef7ea33619132f97dcab045 *c:\temp\tiffs\page-000443.tif
fbb801ab3ef7ea33619132f97dcab045 *c:\temp\tiffs\page-000444.tif
fbc33cc0ff3e1252de1653ef2e978f94 *c:\temp\tiffs\page-000699.tif
fbc33cc0ff3e1252de1653ef2e978f94 *c:\temp\tiffs\page-000700.tif
fc3fd164e20bb707acddeabbc4e60f7e *c:\temp\tiffs\page-000899.tif
fc3fd164e20bb707acddeabbc4e60f7e *c:\temp\tiffs\page-000900.tif
[....]
I leave it to you to automate that step.
Fifth and Sixth: Delete all duplicate pages from your original PDF. Assume you now want to delete pages 33, 77, 188, 444, 700 and 900. Here is the pdftk.exe
command to achieve this:
c: > pdftk.exe A=12000pages.pdf ^
cat A1-32 A34-76 A78-187 A189-443 A445-699 A701-899 A901-end ^
output nonduplicates.pdf
*Edit: Don't know why I suggested TIFF at first -- more intelligent would be to use BMP. *
If you use -sDEVICE=bmp256
and -sOutputFile=C:\temp\tiffs\page-%06d.bmp
you will not have to deal with the 'normalisation' step I outlined above. The rest of the procedure (md5sum ...
) is the same....
The pdftk can split/combine/remove pages in PDF files. I don't know any function for finding duplicates.
You could split the document into individual pages and then either using just the file size or converting to plain text and using diff, find adjacent matching pages an delete them - then recombine into a single doc.