Hyphenation (end-of-line division) of "Germany" and some other common words

I am currently trying to build a database of English words and their hyphenations (end-of-line divisions) (en-US, if it matters), and thereby have come across some words which I have found contradicting hyphenations for. If those words were exotic, I would not be wondering about it, but some of them are frequently used. For example:

  • Germany: Merriam-Webster - Ger-ma-ny; Hunspell (which by far is the most dominant spell checker and hyphenator in the open source scene, driving applications like LibreOffice, OpenOffice, Firefox, Thunderbird and the like) - Ger-many

  • freely: Merriam-Webster - free-ly; Hunspell - freely

  • rapid: Merriam-Webster - rap-id; Hunspell - rapid

I have read a lot of articles (most of them on this site) about hyphenation. The general consensus seems to be that we should look up the respective word and its hyphenation in authoritative sources. But what if those sources contradict each other?

Another advice which often was given was that we just should hyphenate between syllables. Since I am not a native English speaker, this is extremely difficult for me. While I would have done it right with Germany and freely, I would never have done it right with rapid (in my world, it would have been ra-pid).

I always have considered the Oxford English Dictionary to be the most authoritative English dictionary. Imagine my surprise when I saw that they neither show hyphenation nor syllabication. The Wiktionary does show hyphenation, but only for some words; the examples mentioned above, being very common words, are not among them, so it's worthless in this respect.

Could somebody please give me a hint what I should do if two important sources which both can (somehow) be considered authoritative show contradicting hyphenations, and even more important, could somebody please tell me if there is a reliable method to identify words which are suspect in this respect in the first place?

To explain the latter: I am currently using the hunspell data to build my database semi-automatically; otherwise, I couldn't handle it. The hunspell data is the only one I have found to be usable to get the hyphenation of a word quite easily.

As a second step, I would like to be able to identify and separate suspect words, which I then could look up manually in different sources (hoping that only about 5% of the words are suspect).

EDIT 1

As a reaction to one of the comments, I now have found a word where at least 3 characters are left at each side after hyphenation, but where different "authorities" hyphenate differently:

Microsoft Word 2010 hyphenates inconceivable as in-con-ceiv-a-ble, where Merriam-Webster has in-con-ceiv-able.

Another one: Merriam-Webster says cli-ent, where hunspell says client, i.e. does not hyphenate that word at all.

EDIT 2

@Hot Licks has pointed out that the dictionaries are showing syllable boundaries, not hyphenation points (if any). However, at least in case of Merriam-Webster, this is the same. From their dictionary API documentation:

<hw>...</hw>    (text = boldface)
    HEADWORD
    - This is the first bold word in an entry
    - contains "syllable" break points (that is, 
      end-of-line hyphenation points) here indicated 
      by asterisks, which will translate to raised dot, 
      {point} in Merriam-Webster font. 
    - may contain superscript homograph numbers 
      {h,1}, {h,2}, etc., in the same font (bold)
    - single word space after <hw> field

Please note the text following the second hyphen. IMHO, that means that each syllable boundary is a hyphenation point, and vice versa.

EDIT 3

I have found more precise information. From Merriam-Webster's guide to pronunciation:

Hyphens are used to separate syllables in pronunciation transcriptions. [...]

The centered dots in boldface entry words indicate potential end-of-line division points and not syllabication. [...] As a result, the hyphens indicating syllable breaks and the centered dots indicating end-of-line division often do not fall in the same places.


Solution 1:

The first thing you have to understand is that hyphenation in English is done on different principles:

  • An "American" system, which derives from this "Hyphens are used to separate syllables in pronunciation transcriptions." This involves two basic fallacies: pronunciation transcriptions are a rare special case of the use of hyphenation, which is normally used for texts that are to be read, not recited; and even if you wanted to use this as a base, there are lots of differences in syllabication between regional dialects.

  • A "British" system, which breaks words according to their etymological components (prefixes and suffixes etc.). This makes the word breaks easier to follow, and should be preferred. Thus: con-ceiv-able. But this puts you in conflict with Microsoft and the like, of course.

Solution 2:

If you search hunspell hyphenation you should find an end-of-line hyphenation library (import from TeX) that should suit your needs. The min right and left lengths are variables.

I don't know if this can detect part-of-speech such as (verb) pro-ject vs (noun) proj-ect.

Solution 3:

This answer gives the general principles behind hyphenating words in English.

There is no single source for hyphenation in english. While all the sources follow the same principles, different sources make different judgment calls, so it's not surprising that they give different results.

No respectable source (this would include dictionaries and Hunspell) should give you an unacceptable hyphenation, so it's fine to pick one and use it. You should note, however, that some words like project have different hyphenations depending on whether they are a noun or a verb, and some, like debris, have different hyphenations in British and American English. This is because hyphenation sometimes depend on pronunciation, and pronunciation varies.