How to extract text from MS office documents in C#
I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.
For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.
See: http://msdn.microsoft.com/en-us/library/bb448854.aspx
public static string TextFromWord(SPFile file)
{
const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
StringBuilder textBuilder = new StringBuilder();
using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))
{
// Manage namespaces to perform XPath queries.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
// Get the document part from the package.
// Load the XML in the document part into an XmlDocument instance.
XmlDocument xdoc = new XmlDocument(nt);
xdoc.Load(wdDoc.MainDocumentPart.GetStream());
XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
foreach (XmlNode paragraphNode in paragraphNodes)
{
XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
foreach (System.Xml.XmlNode textNode in textNodes)
{
textBuilder.Append(textNode.InnerText);
}
textBuilder.Append(Environment.NewLine);
}
}
return textBuilder.ToString();
}
Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).