Search fi (one character) as fi (two characters) in mdfind

Solution 1:

Although the fi characters are displayed as a single ligature glyph, they are understood within the PDF as distinct letters. (And within every other text app such as TextEdit, Pages, Safari, etc, which will also display ligatures and understand them as separate characters.)

I can search in Safari or Preview within the PDF for the letters fi, and get the ligature in the results:

Safari find

I can also copy and paste the text, or export it from the PDF, and the text has separate characters for that ligature.

However, results using Spotlight do seem to be more variable. If I create a PDF from TextEdit with the word 'office' using ligature glyphs, that word is not be found in a Spotlight search. If I do the same from Affinity Publisher, the word is found.

I have other PDFs with ligature glyphs that Spotlight can search.

It is of course also possible to produce a PDF where the underlying chars are not preserved.

TL;DR: it seems that Spotlight is choosy about font encoding when indexing PDF text content. Text encoded with a Type 1 Roman encoding does not produce the correct result.

So your options are to write a shell script that offers up the ligated Unicode glyphs whenever the relevant combination of characters occur (fi, fl, ffi, ffl, ct, st), and search PDFs using both forms; or use a non-Spotlight method of querying the text in the PDF.