Full Text Index not reading PDFs on Domino 9 Server
We have a IBM Notes procedure database that uses a separate database to store attachment documents that have the current copy of the latest procedure attached. That database is full text indexed for searching the procedures. Most of the procedures are Word documents and don't seem to have a problem, but a particular kind of procedure is stored as a PDF. The problem we have is with the PDFs. It appears that a search doesn't return anything but Word documents that contain the search phrase even though there are many PDFs that contain the search phrase. Is there a setting or something that needs to be set to get it to find the PDFs? These are true PDFs, not TIFs. MJ
Solution 1:
Unfortunately you can't use the answer from Torsten. Domino started using Apache Tika from version 10.0 onward, and Domino 9.x and prior all used the Verity Keyview filter libraries. Was there ever a point at which the PDFs were indexing?
One thing I might try in order to trouble-shoot this is to enable the INI DEBUG_FT_STREAM=2049. You don't need to restart the server. Rebuild your database's index (load updall -x mydbname). IF the pdf is being processed at all, you should see a log line stating one of the following:
"Indexing Attachment Object: 'myattachment.pdf' Size = 65536 using Keyview"
"Indexing Attachment Object: 'myattachment.pdf' Size = 65536 using Brute Force"
If neither of these show up, then you may need to dig some more. If the "Brute Force" one shows up then, yeah, something from the PDF is being indexed but who knows what. Brute Force just quickly strips out any ASCII text it can find and so the indexed result can be very inaccurate.