Retrieve the respective coordinates of all words on the page with itextsharp
(I'm mostly working with the Java library iText, not with the .Net library iTextSharp; thus, please ignore some Java-isms here, everything should be easy to translate.)
For extracting contents of a page using iText(Sharp), you employ the classes in the parser package to feed it after some preprocessing to a RenderListener
of your choice.
In a context in which you are only interested in the text, you most commonly use a TextExtractionStrategy
which is derived from RenderListener
and adds a single method getResultantText
to retrieve the aggregated text from the page.
As the initial intent of text parsing in iText was to implement this use case, most existing RenderListener
samples are TextExtractionStrategy
implementations and only make the text available.
Therefore, you will have to implement your own RenderListener
which you already seem to have christianed TextWithPositionExtractionStategy
.
Just like there is both a SimpleTextExtractionStrategy
(which is implemented with some assumptions about the structure of the page content operators) and a LocationTextExtractionStrategy
(which does not have the same assumptions but is somewhat more complicated), you might want to start with an implementation that makes some assumptions.
Thus, just like in the case of the SimpleTextExtractionStrategy
, you in your first, simple implementation expect the text rendering events forwarded to your listener to arrive line by line, and on the same line from left to right. This way, as soon as you find a horizontal gap or a punctation, you know your current word is finished and you can process it.
In contrast to the text extraction strategies you don't need a StringBuffer
member to collect your result but instead a list of some "word with position" structure. Furthermore you need some member variable to hold the TextRenderInfo
events you already collected for this page but could not finally process (you may retrieve a word in several separate events).
As soon as you (i.e. your renderText
method) are called for a new TextRenderInfo
object, you should operate like this (pseudo-code):
if (unprocessedTextRenderInfos not empty)
{
if (isNewLine // Check this like the simple text extraction strategy checks for hardReturn
|| isGapFromPrevious) // Check this like the simple text extraction strategy checks whether to insert a space
{
process(unprocessedTextRenderInfos);
unprocessedTextRenderInfos.clear();
}
}
split new TextRenderInfo using its getCharacterRenderInfos() method;
while (characterRenderInfos contain word end)
{
add characterRenderInfos up to excluding the white space/punctuation to unprocessedTextRenderInfos;
process(unprocessedTextRenderInfos);
unprocessedTextRenderInfos.clear();
remove used render infos from characterRenderInfos;
}
add remaining characterRenderInfos to unprocessedTextRenderInfos;
In process(unprocessedTextRenderInfos)
you extract the information you need from the unprocessedTextRenderInfos; you concatenate the individual text contents to a word and take the coordinates you want; if you merely want starting coordinates, you take those from the first of those unprocessed TextRenderInfos. If you need more data, you also use the data from the other TextRenderInfos. With these data you fill a "word with position" structure and add it to your result list.
When page processing is finished, you have to once more call process(unprocessedTextRenderInfos) and unprocessedTextRenderInfos.clear(); alternatively you may do that in the endTextBlock
method.
Having done this, you might feel ready to implement the slightly more complex variant which does not have the same assumptions concerning the page content structure. ;)