Chop pages of a PDFs into multiple pages [closed]

I've got a bunch of PDF files that contain two "real" pages to a single PDF page; I'd like to chop these in half and put each half on a separate page. Essentially, I need something that does the exact opposite of pdfnup (or psnup). How can this feat be achieved?

Platform is Linux, open source preferred; as I've got a great pile of these to do something that can be scripted (as opposed to a GUI) would be nice, so I can just give it a list of them and have it chew away.

A pre-existing script isn't the only option, either; if there's sample code to manipulate PDFs in similar ways with a third-party library, I can probably hack it into doing what I want.

You can solve this with the help of Ghostscript. pdftk alone cannot do that (to the best of my knowledge). I'll give you the commandline steps to do this manually. It will be easy to script this as a procedure, also with different parameters for page sizes and page numbers. But you said that you can do that yourself ;-)

How to solve this with the help of Ghostscript...

...and for the fun of it, I've recently done it not with an input file featuring "double-up" pages, but one with "treble-ups". You can read the answer for this case here.

Your case is even simpler. You seem to have something similar to this:

+------------+------------+   ^
|            |            |   |
|      1     |      2     |   |
|            |            | 595 pt
|            |            |   |
|            |            |   |
|            |            |   |
+------------+------------+   v
             ^
            fold
             v
+------------+------------+   ^
|            |            |   |
|      3     |      4     |   |
|            |            | 595 pt
|            |            |   |
|            |            |   |
|            |            |   |
+------------+------------+   v
<---------- 842 pt -------->

You want to create 1 PDF with 4 pages, each of which has the size of 421 pt x 595 pt.

First Step

Let's first extract the left sections from each of the input pages:

gs \
    -o left-sections.pdf \
    -sDEVICE=pdfwrite \
    -g4210x5950 \
    -c "<</PageOffset [0 0]>> setpagedevice" \
    -f double-page-input.pdf

What did these parameters do?

First, know that in PDF 1 inch == 72 points. Then the rest is:

-o ...............: Names output file. Implicitely also uses -dBATCH -dNOPAUSE -dSAFER.
-sDEVICE=pdfwrite : we want PDF as output format.
-g................: sets output media size in pixels. pdfwrite's default resolution is 720 dpi. Hence multiply by 10 to get a match for PageOffset.
-c "..............: asks Ghostscript to process the given PostScript code snippet just before the main input file (which needs to follow with -f).
<</PageOffset ....: sets shifting of page image on the medium. (Of course, for left pages the shift by [0 0] has no real effect.)
-f ...............: process this input file.

Which result did the last command achieve?

This one:

Output file: left-sections.pdf, page 1
+------------+  ^
|            |  |
|     1      |  |
|            |595 pt
|            |  |
|            |  |
|            |  |
+------------+  v

Output file: left-sections.pdf, page 2
+------------+  ^
|            |  |
|     3      |  |
|            |595 pt
|            |  |
|            |  |
|            |  |
+------------+  v
<-- 421 pt -->

Second Step

Next, the right sections:

gs \
    -o right-sections.pdf \
    -sDEVICE=pdfwrite \
    -g4210x5950 \
    -c "<</PageOffset [-421 0]>> setpagedevice" \
    -f double-page-input.pdf

Note the negative offset since we are shifting the page to the left while keeping the viewing area stationary.

Result:

Output file: right-sections.pdf, page 1
+------------+  ^
|            |  |
|     2      |  |
|            |595 pt
|            |  |
|            |  |
|            |  |
+------------+  v

Output file: right-sections.pdf, page 2
+------------+  ^
|            |  |
|     4      |  |
|            |595 pt
|            |  |
|            |  |
|            |  |
+------------+  v
<-- 421 pt -->

Last Step

Now we combine the pages into one file. We could do that with ghostscript as well, but we'll use pdftk instead, because it's faster for this job:

pdftk \
  A=right-sections.pdf \
  B=left-sections.pdf \
  shuffle \
  output single-pages-output.pdf
  verbose

Done. Here is the desired result. 4 different pages, sized 421x595 pt.

Result:

+------------+ +------------+ +------------+ +------------+   ^
|            | |            | |            | |            |   |
|     1      | |     2      | |     3      | |     4      |   |
|            | |            | |            | |            |5595 pt
|            | |            | |            | |            |   |
|            | |            | |            | |            |   |
|            | |            | |            | |            |   |
+------------+ +------------+ +------------+ +------------+   v
<-- 421 pt --> <-- 421 pt --> <-- 421 pt --> <-- 421 pt -->

There is a tool pdfposter which can be used to create PDFs with several pages for one input page (tiling or chopping the pages). It is similar to the tool poster, which does the same for PostScript files.

So, after a lot more searching (it seems that "PDF cut pages" is a far better search), I found a little script called unpnup which uses poster, PDF/PS conversion, and pdftk to do exactly what I need. It's a bit of a long way around, but it's far superior to the other methods I found (such as using imagemagick) because it doesn't rasterise the pages before spitting them out.

Just in case mobileread goes away for some reason, the core of the script (licenced under the GPLv2 or later by Harald Hackenberg <hackenberggmx.at>) is as follows:

pdftk "$1" burst
for file in pg*.pdf;
do
    pdftops -eps $file
    poster -v -pA4 -mA5 -c0% `basename $file .pdf`.eps > `basename $file .pdf`.tps
    epstopdf `basename $file .pdf`.tps
done
pdftk pg*.pdf cat output ../`basename $1 .pdf`_unpnuped.pdf

I found the answer by Kurt Pfeifle to be very helpful for my similar situation. I thought I might share my modification of the solution with others...

I too had a scanned PDF that had 2 pages on each sheet. It was an 11 x 8.5 (inch) scan of a saddle-stitched booklet that was left stapled when originally scanned, so: PDF page 1 = back and front cover; PDF page 2 = pages 2 and 3, etc. This reads fine onscreen but you can't print it and then staple it to make more copies of the booklet.

I needed to be able to print this on a duplex copier; i.e. turn it BACK into an "imposed" PDF, ready for printing. So using Kurt's solution, I made this (ahem) "one-liner" to convert it back into half-pages, in the correct page order again. It will work for any HEIGHT and WIDTH, and also for any number of pages. In my case, I had a 40-page booklet (20 scanned pages in the PDF.)

HEIGHT=8.5 WIDTH=11 ORIG_FILE_PATH="original.pdf" \
count=$(set -xe; \
gs -o left.pdf -sDEVICE=pdfwrite \
-g$(perl -e "print(($WIDTH / 2) * 720)")x$(perl -e "print($HEIGHT * 720)") \
-c "<</PageOffset [0  0]>> setpagedevice" \
-f "$ORIG_FILE_PATH" >/dev/null; \
gs -o right.pdf -sDEVICE=pdfwrite \
-g$(perl -e "print(($WIDTH / 2) * 720)")x$(perl -e "print($HEIGHT * 720)") \
-c "<</PageOffset [-$(perl -e "print(($WIDTH / 2) * 72)")  0]>> setpagedevice" \
-f "$ORIG_FILE_PATH" | grep Page | wc -l ); \
echo '>>>>>' Re-ordering $count pages...; \
(set -xe; pdftk A=right.pdf B=left.pdf cat \
A1 `set +xe; for x in $(seq 2 $count); do echo B$x A$x; done` B1 \
output ordered.pdf); \
echo "Done. See ordered.pdf"

You only need to alter the first few parameters in this command to specify the HEIGHT and WIDTH and ORIG_FILE_PATH. The remainder of the command calculates the various sizes and calls gs twice, then pdftk. It will even count the pages in your scan and then produce the correct sort specification (for the scenario I gave).

It outputs some progress about what it's doing, which will look like this:

+++ perl -e 'print((11 / 2) * 720)'
+++ perl -e 'print(8.5 * 720)'
++ gs -o left.pdf -sDEVICE=pdfwrite -g3960x6120 -c '<</PageOffset [0  0]>> setpagedevice' -f original.pdf
++ wc -l
++ grep Page
+++ perl -e 'print((11 / 2) * 720)'
+++ perl -e 'print(8.5 * 720)'
+++ perl -e 'print((11 / 2) * 72)'
++ gs -o right.pdf -sDEVICE=pdfwrite -g3960x6120 -c '<</PageOffset [-396  0]>> setpagedevice' -f original.pdf
>>>>> Re-ordering 20 pages...
++ set +xe
+ pdftk A=right.pdf B=left.pdf cat A1 B2 A2 B3 A3 B4 A4 B5 A5 B6 A6 B7 A7 B8 A8 B9 A9 B10 A10 B11 A11 B12 A12 B13 A13 B14 A14 B15 A15 B16 A16 B17 A17 B18 A18 B19 A19 B20 A20 B1 output ordered.pdf
Done. See ordered.pdf

Next, to get the page imposition you need for a printed booklet, you just "print" ordered.pdf on a custom page size of exactly the size you need (in my example, 5.5 x 8.5), sending it to a "booklet making" tool (in my case, I used Christoph Vogelbusch's Create Booklet for Mac from http://download.cnet.com/Create-Booklet/3000-2088_4-86349.html).

The resulting PDF will now be back to the original page size of 11 x 8.5 with 2 pages per sheet, but the ordering will be such that you can print it double-sided, short-edge binding, and voilà! you will have a printout you can photocopy and fold and saddle-stitch, reproducing the original booklet without ever disassembling (or even necessarily seeing) the original.

Hope this helps someone!

-c