Extracting text from PDFs in C# [closed]
Solution 1:
There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.
A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:
moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")
This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.
Solution 2:
Take a look at Tika on DotNet, available through Nuget: https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/
This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:
var text = new TextExtractor().Extract(file.FullName).Text;
Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx
Solution 3:
In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are
- table detection;
- text extraction as CSV, XML or formatted text (with the optional layout restoration);
- text search with support for regular expressions;
- low-level API to access text objects
DISCLAIMER: I'm affiliated with ByteScout
Solution 4:
You can try Toxy, a text/data extraction framework in .NET. It supports .NET standard 2.0. For detail, please visit https://github.com/nissl-lab/toxy