How do I know if PDF pages are color or black-and-white?

This is one of the most interesting questions I've seen! I agree with some of the other posts that rendering to a bitmap and then analyzing the bitmap will be the most reliable solution. For simple PDFs, here's a faster but less complete approach.

  1. Parse each PDF page
  2. Look for color directives (g, rg, k, sc, scn, etc)
  3. Look for embedded images, analyze for color

My solution below does #1 and half of #2. The other half of #2 would be to follow up with user-defined color, which involves looking up the /ColorSpace entries in the page and decoding them -- contact me offline if this is interesting to you, as it's very doable but not in 5 minutes.

First the main program:

use CAM::PDF;

my $infile = shift;
my $pdf = CAM::PDF->new($infile);
PAGE:
for my $p (1 .. $pdf->numPages) {
   my $tree = $pdf->getPageContentTree($p);
   if (!$tree) {
      print "Failed to parse page $p\n";
      next PAGE;
   }
   my $colors = $tree->traverse('My::Renderer::FindColors')->{colors};
   my $uncertain = 0;
   for my $color (@{$colors}) {
      my ($name, @rest) = @{$color};
      if ($name eq 'g') {
      } elsif ($name eq 'rgb') {
         my ($r, $g, $b) = @rest;
         if ($r != $g || $r != $b) {
            print "Page $p is color\n";
            next PAGE;
         }
      } elsif ($name eq 'cmyk') {
         my ($c, $m, $y, $k) = @rest;
         if ($c != 0 || $m != 0 || $y != 0) {
            print "Page $p is color\n";
            next PAGE;
         }
      } else {
         $uncertain = $name;
      }
   }
   if ($uncertain) {
      print "Page $p has user-defined color ($uncertain), needs more investigation\n";
   } else {
      print "Page $p is grayscale\n";
   }
}

And then here's the helper renderer that handles color directives on each page:

package My::Renderer::FindColors;

sub new {
   my $pkg = shift;
   return bless { colors => [] }, $pkg;
}
sub clone {
   my $self = shift;
   my $pkg = ref $self;
   return bless { colors => $self->{colors}, cs => $self->{cs}, CS => $self->{CS} }, $pkg;
}
sub rg {
   my ($self, $r, $g, $b) = @_;
   push @{$self->{colors}}, ['rgb', $r, $g, $b];
}
sub g {
   my ($self, $gray) = @_;
   push @{$self->{colors}}, ['rgb', $gray, $gray, $gray];
}
sub k {
   my ($self, $c, $m, $y, $k) = @_;
   push @{$self->{colors}}, ['cmyk', $c, $m, $y, $k];
}
sub cs {
   my ($self, $name) = @_;
   $self->{cs} = $name;
}
sub cs {
   my ($self, $name) = @_;
   $self->{CS} = $name;
}
sub _sc {
   my ($self, $cs, @rest) = @_;
   return if !$cs; # syntax error                                                                                             
   if ($cs eq 'DeviceRGB') { $self->rg(@rest); }
   elsif ($cs eq 'DeviceGray') { $self->g(@rest); }
   elsif ($cs eq 'DeviceCMYK') { $self->k(@rest); }
   else { push @{$self->{colors}}, [$cs, @rest]; }
}
sub sc {
   my ($self, @rest) = @_;
   $self->_sc($self->{cs}, @rest);
}
sub SC {
   my ($self, @rest) = @_;
   $self->_sc($self->{CS}, @rest);
}
sub scn { sc(@_); }
sub SCN { SC(@_); }
sub RG { rg(@_); }
sub G { g(@_); }
sub K { k(@_); }

Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100% (see Detecting all pages which contain color).

For example:

$ gs -q -o - -sDEVICE=inkcov file.pdf 
0.11264  0.11605  0.11605  0.09364 CMYK OK
0.11260  0.11601  0.11601  0.09360 CMYK OK

If the CMY values are not 0 then the page is color.

To just output the pages that contain colors use this handy oneliner:

$ gs -o - -sDEVICE=inkcov file.pdf |tail -n +4 |sed '/^Page*/N;s/\n//'|sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000  / d'

It is possible to use the Image Magick tool identify. If used on PDF pages it converts the page first to a raster image. If the page contained color can be tested using the -format "%[colorspace]" option, which for my PDF printed either Gray or RGB. IMHO identify (or what ever tool it uses in the background; Ghostscript?) does choose the colorspace depending on the presents of color.

An example is:

identify -format "%[colorspace]" $FILE.pdf[$PAGE]

where PAGE is the page starting from 0, not 1. If the page selection is not used all pages will be collapsed to one, which is not what you want.

I wrote the following BASH script which uses pdfinfo to get the number of pages and then loops over them. Outputting the pages which are in color. I also added a feature for double sided document where you might need a non-colored backside page as well.

Using the outputted space separated list the colored PDF pages can be extracted using pdftk:

pdftk $FILE cat $PAGELIST output color_${FILE}.pdf

#!/bin/bash

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES $N"
    else
        COLORPAGES="$COLORPAGES $N"
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $N $((N+1))"
            #N=$((N+1))
        else
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $((N-1)) $N"
        fi
    fi
    N=$((N+1))
done

echo $DOUBLECOLORPAGES
echo $COLORPAGES
echo $GRAYPAGES
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf