PDF Compression with iTextSharp [closed]
I am currently trying to recompress a pdf that has already been created, I am trying to find a way to recompress the images that are in the document, to reduce the file size.
I have been trying to do this with the DataLogics PDE and iTextSharp libraries but I can not find a way to do the stream recompression of the items.
I have though about looping over the xobjects and getting the images and then dropping the DPI down to 96 or using the libjpeg C# implimentation to change the quality of the image but getting it back into the pdf stream seems to always end up, with memory corruption or some other issue.
Any samples will be appreciated.
Thanks
iText and iTextSharp have some methods for replacing indirect objects. Specifically there's PdfReader.KillIndirect()
which does what it says and PdfWriter.AddDirectImageSimple(iTextSharp.text.Image, PRIndirectReference)
which you can then use to replace what you killed off.
In pseudo C# code you'd do:
var oldImage = PdfReader.GetPdfObject();
var newImage = YourImageCompressionFunction(oldImage);
PdfReader.KillIndirect(oldImage);
yourPdfWriter.AddDirectImageSimple(newImage, (PRIndirectReference)oldImage);
Converting the raw bytes to a .Net image can be tricky, I'll leave that up to you or you can search here. Mark has a good description here. Also, technically PDFs don't have a concept of DPI, that's for printers mostly. See the answer here for more on that.
Using the method above your compression algorithm can actually do two things, physically shrink the image as well as apply JPEG compression. When you physically shrink the image and add it back it will occupy the same amount of space as the original image but with less pixels to work with. This will get you what you consider to be DPI reduction. The JPEG compression speaks for itself.
Below is a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It takes an existing JPEG on your desktop called "LargeImage.jpg" and creates a new PDF from it. Then it opens the PDF, extracts the image, physically shrinks it to 90% of the original size, applies 85% JPEG compression and writes it back to the PDF. See the comments in the code for more of an explanation. The code needs lots more null/error checking. Also looks for NOTE
comments where you'll need to expand to handle other situations.
using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.Drawing.Drawing2D;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication1 {
public partial class Form1 : Form {
public Form1() {
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e) {
//Our working folder
string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//Large image to add to sample PDF
string largeImage = Path.Combine(workingFolder, "LargeImage.jpg");
//Name of large PDF to create
string largePDF = Path.Combine(workingFolder, "Large.pdf");
//Name of compressed PDF to create
string smallPDF = Path.Combine(workingFolder, "Small.pdf");
//Create a sample PDF containing our large image, for demo purposes only, nothing special here
using (FileStream fs = new FileStream(largePDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (Document doc = new Document()) {
using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
iTextSharp.text.Image importImage = iTextSharp.text.Image.GetInstance(largeImage);
doc.SetPageSize(new iTextSharp.text.Rectangle(0, 0, importImage.Width, importImage.Height));
doc.SetMargins(0, 0, 0, 0);
doc.NewPage();
doc.Add(importImage);
doc.Close();
}
}
}
//Now we're going to open the above PDF and compress things
//Bind a reader to our large PDF
PdfReader reader = new PdfReader(largePDF);
//Create our output PDF
using (FileStream fs = new FileStream(smallPDF, FileMode.Create, FileAccess.Write, FileShare.None)) {
//Bind a stamper to the file and our reader
using (PdfStamper stamper = new PdfStamper(reader, fs)) {
//NOTE: This code only deals with page 1, you'd want to loop more for your code
//Get page 1
PdfDictionary page = reader.GetPageN(1);
//Get the xobject structure
PdfDictionary resources = (PdfDictionary)PdfReader.GetPdfObject(page.Get(PdfName.RESOURCES));
PdfDictionary xobject = (PdfDictionary)PdfReader.GetPdfObject(resources.Get(PdfName.XOBJECT));
if (xobject != null) {
PdfObject obj;
//Loop through each key
foreach (PdfName name in xobject.Keys) {
obj = xobject.Get(name);
if (obj.IsIndirect()) {
//Get the current key as a PDF object
PdfDictionary imgObject = (PdfDictionary)PdfReader.GetPdfObject(obj);
//See if its an image
if (imgObject.Get(PdfName.SUBTYPE).Equals(PdfName.IMAGE)) {
//NOTE: There's a bunch of different types of filters, I'm only handing the simplest one here which is basically raw JPG, you'll have to research others
if (imgObject.Get(PdfName.FILTER).Equals(PdfName.DCTDECODE)) {
//Get the raw bytes of the current image
byte[] oldBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObject);
//Will hold bytes of the compressed image later
byte[] newBytes;
//Wrap a stream around our original image
using (MemoryStream sourceMS = new MemoryStream(oldBytes)) {
//Convert the bytes into a .Net image
using (System.Drawing.Image oldImage = Bitmap.FromStream(sourceMS)) {
//Shrink the image to 90% of the original
using (System.Drawing.Image newImage = ShrinkImage(oldImage, 0.9f)) {
//Convert the image to bytes using JPG at 85%
newBytes = ConvertImageToBytes(newImage, 85);
}
}
}
//Create a new iTextSharp image from our bytes
iTextSharp.text.Image compressedImage = iTextSharp.text.Image.GetInstance(newBytes);
//Kill off the old image
PdfReader.KillIndirect(obj);
//Add our image in its place
stamper.Writer.AddDirectImageSimple(compressedImage, (PRIndirectReference)obj);
}
}
}
}
}
}
}
this.Close();
}
//Standard image save code from MSDN, returns a byte array
private static byte[] ConvertImageToBytes(System.Drawing.Image image, long compressionLevel) {
if (compressionLevel < 0) {
compressionLevel = 0;
} else if (compressionLevel > 100) {
compressionLevel = 100;
}
ImageCodecInfo jgpEncoder = GetEncoder(ImageFormat.Jpeg);
System.Drawing.Imaging.Encoder myEncoder = System.Drawing.Imaging.Encoder.Quality;
EncoderParameters myEncoderParameters = new EncoderParameters(1);
EncoderParameter myEncoderParameter = new EncoderParameter(myEncoder, compressionLevel);
myEncoderParameters.Param[0] = myEncoderParameter;
using (MemoryStream ms = new MemoryStream()) {
image.Save(ms, jgpEncoder, myEncoderParameters);
return ms.ToArray();
}
}
//standard code from MSDN
private static ImageCodecInfo GetEncoder(ImageFormat format) {
ImageCodecInfo[] codecs = ImageCodecInfo.GetImageDecoders();
foreach (ImageCodecInfo codec in codecs) {
if (codec.FormatID == format.Guid) {
return codec;
}
}
return null;
}
//Standard high quality thumbnail generation from http://weblogs.asp.net/gunnarpeipman/archive/2009/04/02/resizing-images-without-loss-of-quality.aspx
private static System.Drawing.Image ShrinkImage(System.Drawing.Image sourceImage, float scaleFactor) {
int newWidth = Convert.ToInt32(sourceImage.Width * scaleFactor);
int newHeight = Convert.ToInt32(sourceImage.Height * scaleFactor);
var thumbnailBitmap = new Bitmap(newWidth, newHeight);
using (Graphics g = Graphics.FromImage(thumbnailBitmap)) {
g.CompositingQuality = CompositingQuality.HighQuality;
g.SmoothingMode = SmoothingMode.HighQuality;
g.InterpolationMode = InterpolationMode.HighQualityBicubic;
System.Drawing.Rectangle imageRectangle = new System.Drawing.Rectangle(0, 0, newWidth, newHeight);
g.DrawImage(sourceImage, imageRectangle);
}
return thumbnailBitmap;
}
}
}
I don't know about iTextSharp, but you have to rewrite a PDF file if anything is changed, as it contains an xref table (index) with the exact file position of each object. This means if even one byte is added or removed, the PDF becomes corrupted.
Your best bet for recompressing the images is JBIG2 if they are B&W, or JPEG2000 otherwise, for which Jasper library will happily encode JPEG2000 codestreams for placement into PDF files at whatever quality you so desire.
If it were me I'd do it all from code without the PDF libraries. Just find all images (anything between stream
and endstream
after an occurance of JPXDecode
(JPEG2000), JBIG2Decode
(JBIG2) or DCTDecode
(JPEG)) pull that out, reencode it with Jasper, then stick it back in again and update the xref table.
To update the xref table, find the positions of each object (starting 00001 0 obj
) and just update the new positions in the xref table. It's not too much work, less than it sounds. You might be able to get all the offsets with a single regular expression (I'm not a C# programmer, but in PHP it would be that simple.)
Then finally update the value of the startxref
tag in the trailer
with the offset of the beginning of the xref table (where it says xref
in the file).
Otherwise you'll end up decoding the entire PDF and rewriting it all, which will be slow, and you might lose something along the way.