how to get all links(anchor text) pointing to a particular wikidata entity
I am working on a ML problem for which I need list of all the "anchor text"s whose link points to a particular wikidata entity.
for example : For desired entity "Federal Reserve"(Q2044983). Links to this entity may appear in many pages(other entity descriptions) in Wikipedia. And these links may have different anchor text like below-
- 'U.S. Federal Reserve Board'
- 'Fed'
- 'U.S. Federal Reserve System'
- 'Federal Reserve Bank' etc.
I need to extract the above anchor texts.
Current Progress : I am currently trying with wikidata to get these, but have not been successful. Any help is much appreciated.
Wikidata does not help in this context, as the anchor texts are not stored in it. Anchor texts are stored in the page content(wikitext) and can be accessed only by getting wikitext of the relevant pages.
First step is to get the links of all the pages from the article namespace that link to the page you are interested. (url)
The links can be of three types 'transclusions', 'links' and 'redirects'.
Transclusions are more relevant for templates and not articles. Redirects may meet your requirement (url). If you need anchor texts, then you need to get wikitext for each link page and look for that anchor text, by searching for "[[<pagename or its redirects name>|<anchor text>]]
"
The links information can be accessed through Mediawiki api (url)
The wikitext information for an example page Alaska through mediawiki API (url) If you do not find the pattern, that means the link is through a template which appears at the end of the article (United States articles) which you can ignore.