How can I retrieve Wiktionary word content?

Solution 1:

The Wiktionary API can be used to query whether or not a word exists.

Examples for existing and non-existing pages:

http://en.wiktionary.org/w/api.php?action=query&titles=test http://en.wiktionary.org/w/api.php?action=query&titles=testx

The first link provides examples on other types of formats that might be easier to parse.

To retrieve the word's data in a small XHTML format (should more than existence be required), request the printable version of the page:

http://en.wiktionary.org/w/index.php?title=test&printable=yes http://en.wiktionary.org/w/index.php?title=testx&printable=yes

These can then be parsed with any standard XML parser.

Solution 2:

There are a few caveats in just checking that Wiktionary has a page with the name you are looking for:

Caveat #1: All Wiktionaries including the English Wiktionary actually have the goal of including every word in every language, so if you simply use above API call you will know that the word you are asking about is a word in at least one language, but not necessarily English: http://en.wiktionary.org/w/api.php?action=query&titles=dicare

Caveat #2: Perhaps a redirect exists from one word to another word. It might be from an alternative spelling, but it might be from an error of some kind. The API call above will not differentiate between a redirect and an article: http://en.wiktionary.org/w/api.php?action=query&titles=profilemetry

Caveat #3: Some Wiktionaries including the English Wiktionary include "common misspellings": http://en.wiktionary.org/w/api.php?action=query&titles=fourty

Caveat #4: Some Wiktionaries allow stub entries which have little or no information about the term. This used to be common on several Wiktionaries but not the English Wiktionary. But it seems to have now spread also to the English Wiktionary: https://en.wiktionary.org/wiki/%E6%99%B6%E7%90%83 (permalink for when the stub is filled so you can still see what a stub looks like: https://en.wiktionary.org/w/index.php?title=%E6%99%B6%E7%90%83&oldid=39757161)

If these are not included in what you want, you will have to load and parse the wikitext itself, which is not a trivial task.

Solution 3:

You can download a dump of Wiktionary data. There's more information in the FAQ. For your purposes, the definitions dump is probably a better choice than the XML dump.

Solution 4:

To keep it really simple, extract the words from the dump like this:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

How can I retrieve Wiktionary word content?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts