How to remove duplicate fonts in a PDF document with 150,000 embedded fonts?

Type3 fonts are incredibly rare in PDF files. Type3 fonts are actually PDF fonts, in that the glyphs are described in PDF page graphic operators. So you would never encounter a type 3 font outside of a PDF, since only a PDF renderer would know what to do.

  1. The glyphs are defined in the object referenced by the CharProcs key. So Object 239 in your last example. The /FontBBox is normally just used for text selection. You could probably just union all all the FontBBoxes.

  2. You could check the graphic operators, or even just hash the streams, to find matches. Then perhaps you could synthesize new fonts with the collected fonts. However, you also need to check the encodings. If each font is encoded differently, with different character codes mapping to different glyphs, then you need to go back and also rewrite the page content streams, using the new character codes. Finally, you probably want to keep the ToUnicode mappings correct (if you want to preserve text selection/extraction), which means also tracking character code to unicode mappings and generating new ToUnicode CMaps.

In short, repairing as a post-processing step is non-trivial.

It is typically much better/easier to go back and deal with the root of the issue, when you create the PDF files, and merge, so you don't have the issue.