Extracting text from PDFs in C# [closed]

Solution 1:

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

Solution 2:

Take a look at Tika on DotNet, available through Nuget: https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:

var text = new TextExtractor().Extract(file.FullName).Text;

Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

Solution 3:

In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are

  • table detection;
  • text extraction as CSV, XML or formatted text (with the optional layout restoration);
  • text search with support for regular expressions;
  • low-level API to access text objects

DISCLAIMER: I'm affiliated with ByteScout

Solution 4:

You can try Toxy, a text/data extraction framework in .NET. It supports .NET standard 2.0. For detail, please visit https://github.com/nissl-lab/toxy