Get Text Content from mediawiki page via API
Use action=parse
to get the html:
/api.php?action=parse&page=test
One way to get the text from the html would be to load it into a browser and walk the nodes, looking only for the text nodes, using JavaScript.
The TextExtracts extension of the API does about what you're asking. Use prop=extracts
to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.
Just to include a visible link in my answer, the above link looks like:
/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true
Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.
Adding ?action=raw
at the end of a MediaWiki page return the latest content in a raw text format. Eg:- https://en.wikipedia.org/wiki/Main_Page?action=raw
You can get the wiki data in text format from the API by using the explaintext
parameter. Plus, if you need to access many titles' information, you can get all the titles' wiki data in a single call. Use the pipe character |
to separate each title. For example, this API call will return the data from both the "Google" and "Yahoo" pages:
http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=
Parameters:
-
explaintext
: Return extracts as plain text instead of limited HTML. -
exlimit=max
: Return more than one result. The max is currently 20. -
exintro
: Return only the content before the first section. If you want the full data, just remove this. -
redirects=
: Resolve redirect issues.
That's the simplest way: http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content