How can I correctly prefix a word with "a" and "an"?
Solution 1:
- Download Wikipedia
- Unzip it and write a quick filter program that spits out only article text (the download is generally in XML format, along with non-article metadata too).
- Find all instances of a(n).... and make an index on the following word and all of its prefixes (you can use a simple suffixtrie for this). This should be case sensitive, and you'll need a maximum word-length - 15 letters?
- (optional) Discard all those prefixes which occur less than 5 times or where "a" vs. "an" achieves less than 2/3 majority (or some other threshholds - tweak here). Preferably keep the empty prefix to avoid corner-cases.
- You can optimize your prefix database by discarding all those prefixes whose parent shares the same "a" or "an" annotation.
- When determining whether to use "A" or "AN" find the longest matching prefix, and follow its lead. If you didn't discard the empty prefix in step 4, then there will always be a matching prefix (namely the empty prefix), otherwise you may need a special case for a completely-non matching string (such input should be very rare).
You probably can't get much better than this - and it'll certainly beat most rule-based systems.
Edit: I've implemented this in JS/C#. You can try it in your browser, or download the small, reusable javascript implementation it uses. The .NET implementation is package AvsAn
on nuget. The implementations are trivial, so it should be easy to port to any other language if necessary.
Turns out the "rules" are quite a bit more complex than I thought:
- it's an unanticipated result but it's a unanimous vote
- it's an honest decision but a honeysuckle shrub
- Symbols: It's an 0800 number, or an ∞ of oregano.
- Acronyms: It's a NASA scientist, but an NSA analyst; a FIAT car but an FAA policy.
...which just goes to underline that a rule based system would be tricky to build!
Solution 2:
You need to use a list of exceptions. I don't think all of the exceptions are well defined, because it sometimes depends on the accent of the person saying the word.
One stupid way is to ask Google for the two possibilities (using the one of the search APIs) and use the most popular:
- http://www.google.co.uk/search?q=%22a+europe%22 - 841,000 hits
- http://www.google.co.uk/search?q=%22an+europe%22 - 25,000 hits
Or:
- http://www.google.co.uk/search?q=%22a+honest%22 - 797,000 hits
- http://www.google.co.uk/search?q=%22an+honest%22 - 8,220,000 hits
Therefore "a europe" and "an honest" are the correct versions.
Solution 3:
If you could find a source of word spellings to word pronunciations, like:
"honest":"on-ist"
"horrible":"hawr-uh-buhl, hor-"
You could base your decision on the first character of the spelled pronunciation string. For performance, perhaps you could use such a lookup to pre-generate exception sets and use those smaller lookup sets during execution instead.
Edited to add:
!!! - I think you could use this to generate your exceptions: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Not everything will be in the dictionary, of course - meaning not every possible exception would wind up in your exceptions sets - but in that case, you could just default to an for vowels/ a for consonants or use some other heuristic with better odds.
(Looking through the CMU dictionary, I was pleased to see it includes proper nouns for countries and some other places - so it will hande examples like "a Ukrainian", "a USA Today paper", "a Urals-inspired painting".)
Editing once more to add: The CMU dictionary does not contain common acronyms, and you have to worry about those starting with s,f,l,m,n,u,and x. But there are plenty of acronym lists out there, like in Wikipedia, which you could use to add to the exceptions.
Solution 4:
You have to implemented manually and add the exceptions you want like for example if the first letter is 'H' and followed by an 'O' like honest, hour ... and also the opposite ones like europe, university, used ...
Solution 5:
Since "a" and "an" is determined by phonetic rules and not spelling conventions, I would probably do it like this:
- If the first letter of the word is a consonant -> 'a'
- If the first letter of the word is a vowel-> 'an'
- Keep a list of exceptions (heart, x-ray, house) as rjumnro says.