Get Text Content from mediawiki page via API

Use action=parse to get the html:

/api.php?action=parse&page=test

One way to get the text from the html would be to load it into a browser and walk the nodes, looking only for the text nodes, using JavaScript.


The TextExtracts extension of the API does about what you're asking. Use prop=extracts to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.

Just to include a visible link in my answer, the above link looks like:

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.


Adding ?action=raw at the end of a MediaWiki page return the latest content in a raw text format. Eg:- https://en.wikipedia.org/wiki/Main_Page?action=raw


You can get the wiki data in text format from the API by using the explaintext parameter. Plus, if you need to access many titles' information, you can get all the titles' wiki data in a single call. Use the pipe character | to separate each title. For example, this API call will return the data from both the "Google" and "Yahoo" pages:

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Parameters:

  • explaintext: Return extracts as plain text instead of limited HTML.
  • exlimit=max: Return more than one result. The max is currently 20.
  • exintro: Return only the content before the first section. If you want the full data, just remove this.
  • redirects=: Resolve redirect issues.

That's the simplest way: http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content