How do I know if PDF pages are color or black-and-white?

This is one of the most interesting questions I've seen! I agree with some of the other posts that rendering to a bitmap and then analyzing the bitmap will be the most reliable solution. For simple PDFs, here's a faster but less complete approach.

Parse each PDF page
Look for color directives (g, rg, k, sc, scn, etc)
Look for embedded images, analyze for color

My solution below does #1 and half of #2. The other half of #2 would be to follow up with user-defined color, which involves looking up the /ColorSpace entries in the page and decoding them -- contact me offline if this is interesting to you, as it's very doable but not in 5 minutes.

First the main program:

use CAM::PDF;

my $infile = shift;
my $pdf = CAM::PDF->new($infile);
PAGE:
for my $p (1 .. $pdf->numPages) {
   my $tree = $pdf->getPageContentTree($p);
   if (!$tree) {
      print "Failed to parse page $p\n";
      next PAGE;
   }
   my $colors = $tree->traverse('My::Renderer::FindColors')->{colors};
   my $uncertain = 0;
   for my $color (@{$colors}) {
      my ($name, @rest) = @{$color};
      if ($name eq 'g') {
      } elsif ($name eq 'rgb') {
         my ($r, $g, $b) = @rest;
         if ($r != $g || $r != $b) {
            print "Page $p is color\n";
            next PAGE;
         }
      } elsif ($name eq 'cmyk') {
         my ($c, $m, $y, $k) = @rest;
         if ($c != 0 || $m != 0 || $y != 0) {
            print "Page $p is color\n";
            next PAGE;
         }
      } else {
         $uncertain = $name;
      }
   }
   if ($uncertain) {
      print "Page $p has user-defined color ($uncertain), needs more investigation\n";
   } else {
      print "Page $p is grayscale\n";
   }
}

And then here's the helper renderer that handles color directives on each page:

package My::Renderer::FindColors;

sub new {
   my $pkg = shift;
   return bless { colors => [] }, $pkg;
}
sub clone {
   my $self = shift;
   my $pkg = ref $self;
   return bless { colors => $self->{colors}, cs => $self->{cs}, CS => $self->{CS} }, $pkg;
}
sub rg {
   my ($self, $r, $g, $b) = @_;
   push @{$self->{colors}}, ['rgb', $r, $g, $b];
}
sub g {
   my ($self, $gray) = @_;
   push @{$self->{colors}}, ['rgb', $gray, $gray, $gray];
}
sub k {
   my ($self, $c, $m, $y, $k) = @_;
   push @{$self->{colors}}, ['cmyk', $c, $m, $y, $k];
}
sub cs {
   my ($self, $name) = @_;
   $self->{cs} = $name;
}
sub cs {
   my ($self, $name) = @_;
   $self->{CS} = $name;
}
sub _sc {
   my ($self, $cs, @rest) = @_;
   return if !$cs; # syntax error                                                                                             
   if ($cs eq 'DeviceRGB') { $self->rg(@rest); }
   elsif ($cs eq 'DeviceGray') { $self->g(@rest); }
   elsif ($cs eq 'DeviceCMYK') { $self->k(@rest); }
   else { push @{$self->{colors}}, [$cs, @rest]; }
}
sub sc {
   my ($self, @rest) = @_;
   $self->_sc($self->{cs}, @rest);
}
sub SC {
   my ($self, @rest) = @_;
   $self->_sc($self->{CS}, @rest);
}
sub scn { sc(@_); }
sub SCN { SC(@_); }
sub RG { rg(@_); }
sub G { g(@_); }
sub K { k(@_); }

Newer versions of Ghostscript (version 9.05 and later) include a "device" called inkcov. It calculates the ink coverage of each page (not for each image) in Cyan (C), Magenta (M), Yellow (Y) and Black (K) values, where 0.00000 means 0%, and 1.00000 means 100% (see Detecting all pages which contain color).

For example:

$ gs -q -o - -sDEVICE=inkcov file.pdf 
0.11264  0.11605  0.11605  0.09364 CMYK OK
0.11260  0.11601  0.11601  0.09360 CMYK OK

If the CMY values are not 0 then the page is color.

To just output the pages that contain colors use this handy oneliner:

$ gs -o - -sDEVICE=inkcov file.pdf |tail -n +4 |sed '/^Page*/N;s/\n//'|sed -E '/Page [0-9]+ 0.00000  0.00000  0.00000  / d'

It is possible to use the Image Magick tool identify. If used on PDF pages it converts the page first to a raster image. If the page contained color can be tested using the -format "%[colorspace]" option, which for my PDF printed either Gray or RGB. IMHO identify (or what ever tool it uses in the background; Ghostscript?) does choose the colorspace depending on the presents of color.

An example is:

identify -format "%[colorspace]" $FILE.pdf[$PAGE]

where PAGE is the page starting from 0, not 1. If the page selection is not used all pages will be collapsed to one, which is not what you want.

I wrote the following BASH script which uses pdfinfo to get the number of pages and then loops over them. Outputting the pages which are in color. I also added a feature for double sided document where you might need a non-colored backside page as well.

Using the outputted space separated list the colored PDF pages can be extracted using pdftk:

pdftk $FILE cat $PAGELIST output color_${FILE}.pdf

#!/bin/bash

FILE=$1
PAGES=$(pdfinfo ${FILE} | grep 'Pages:' | sed 's/Pages:\s*//')

GRAYPAGES=""
COLORPAGES=""
DOUBLECOLORPAGES=""

echo "Pages: $PAGES"
N=1
while (test "$N" -le "$PAGES")
do
    COLORSPACE=$( identify -format "%[colorspace]" "$FILE[$((N-1))]" )
    echo "$N: $COLORSPACE"
    if [[ $COLORSPACE == "Gray" ]]
    then
        GRAYPAGES="$GRAYPAGES $N"
    else
        COLORPAGES="$COLORPAGES $N"
        # For double sided documents also list the page on the other side of the sheet:
        if [[ $((N%2)) -eq 1 ]]
        then
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $N $((N+1))"
            #N=$((N+1))
        else
            DOUBLECOLORPAGES="$DOUBLECOLORPAGES $((N-1)) $N"
        fi
    fi
    N=$((N+1))
done

echo $DOUBLECOLORPAGES
echo $COLORPAGES
echo $GRAYPAGES
#pdftk $FILE cat $COLORPAGES output color_${FILE}.pdf

How do I know if PDF pages are color or black-and-white?

Related

Recent Posts