Does Unicode have a unified way to input all diacritics, especially in MS word2010?

I search and find that in word 2010, the only way that can input all kinds of UNICODE characters is to use the "symbol table", which is a time butcher to find the character I need, because the selection is confined to the font section not based on the subdivided morphological or graphical features of the whole UNICODE map.

I just wonder whether all the UNICODE characters (excluding the kanji or similar things) can be subedivided into several conponents in a input method meaning (based only on the keyboard without using any character table), just like ᾧ can be subdivided into ῀+῾+ω+ι these four parts, and I want to use some shortcuts as \slideOV + \roughOV + \omega + \iotaUD, or \~ + \' + \omega + \iotaUD, in which "OV" means over and "UD" means under?

MS word 2010 has made the step but not very complete, what's more, the AutoCorrect is not editable in a bunch. I don't know about the Latex input method for the Unicode, but I think the command are really really too long to remember and use.

PS: I really hope there is some input method includng all latin based UNICODE character and the variants without any settings of character table, or the unimaginable number of code numbers (the time to memorize \uNNNNs and the chance to forget them make this kind of input method too primitive, and the NAME of the UNICODE is too long (it should be shorter to an abbreviation level), even though it does work...)

For example, Here is a wiki about all the "a" shape character in UNICODE.

I'm posting this as an Answer, even though it is strictly a Comment about the MS Word 2010 part of your question. It is too long to fit in a comment. I've also just added some notes on another approach (at the end of the post).

I would experiment a little with using VBA to create/modify your shortcuts, using a subset of the possible characters to begin with (e.g. perhaps Greek letters and relevant diacritics).

What you would be aiming for (using your notation) would be to have the single autocorrect text \~\'\omega\iotaUD insert the single U+1FA7 character, and so on.

The basic VBA is straightforward -

Autocorrect.Entries.Add Name:="\~\'\omega\iotaUD", Value:=&H1FA7

(you would need a little more to deal with the case where you wanted to replace the definitions). I suppose I would opt to put the character first and the diacritics afterwards, e.g. \omega\~\'\iotaUD, but it would be up to you to define a set of conventions that you could work with.

Using VBA looping, and some information from the Unicode tables it would be fairly easy to create autocorrects for every possible combination, e.g.

"greek letter (both cases)"
"greek letter (both cases)" + \~
"greek letter (both cases)" + \'
"greek letter (both cases)" + \iotaUD
"greek letter (both cases)" + \~ + \'
"greek letter (both cases)" + \~ + \iotaUD
"greek letter (both cases)" + \' + \iotaUD
"greek letter (both cases)" + \~ + \' + \iotaUD

Or you could perhaps narrow this to cover only those letters to which these accents are applicable.

But this immediately raises a number of questions/points, including

Is there a limit to the number of autocorrects that Word lets can define?
Is there a practical limit to the number of autocorrects that Word lets you define? (e.g. perhaps everything slows down when you have 1000 or 10000)
The number of character + diacritic combinations is potentially enormous. Which do you really need?

and perhaps one way of pinning the problem down and reducing the number:

Do you only want autocorrects for those characters where a composite exists in the Unicode tables (smaller problem), or
do you want the autocorrects to insert the composite where one exists, and the relevant set of decomposed characters where one does not? (potentially vast problem)

Don't assume from the above that creating a suitable piece of VBA would be easy. Anyone writing such code would have to decide which combinations could be set up using patterns that exist in the Unicode tables, and which would have to be done using "brute force" enumeration. That is why I would start by trying to define a subset of the problem.

Another approach would be to define your "autocorrect" strings, but not actually as autocorrects. The idea would be to type the autocorrects, then press a key that would run a macro that would parse the text you entered and work out which character(s) you wanted to use. With a bit of care, you would be able to enter the strings corresponding to multiple characters so that you only had to press your special key once, rather than for each "complete" character. You would still need to consider some of the points/questions I listed above.

Unicode defines a character code: a set of characters, their coded representations (numbers and Unicode names), and other properties. It does not define input methods. Unicode as such does not define any way to input anything.

The “symbol table” in Word does not let you enter any character – only the characters that have glyphs in the currently selected font. There is a universal way in Word, though: the Alt X method: enter “u+” followed by the Unicode number of a character, then enter Alt X, and the string magically turns to the character. The part “u+” can be omitted if the previous character is not a digit, letter a–f, or x.

A subdivision, or decomposition, resembling the one you describe is possible in Unicode, but the Unicode standard describes it at the character code level only. It is called canonical decomposition, and it means that e.g. “ᾧ” U+1FA7 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI can be decomposed to a simple omega followed by three combining mark characters: U+03C9 U+0314 U+0342 U+0345. Note that in Unicode, a combinining mark appears after the base character. (This differs from common European input methods, where a dead key is often pressed before a base character.)

This means that you could produce the character, in a sense, by entering the four characters in that sequence, possibly using macros or shortcuts or key assignments you have defined for them. But the result would not still be identical with “ᾧ”. It might look the same, and by Unicode principles it is expected to look the same, but it’s still distinct from the form encoded as ine character, U+1FA7. And in practice, it may look different, possibly disturbingly different.

In my test on Word 2007, entering U+03C9 U+0314 U+0342 U+0345 yields the same visual appearance as U+1FA7. This is good news. Older versions of Word had serious difficulties in such issues. But it’s still four characters (four code points). A word processor could convert such a sequence to a corresponding canonically equivalent character, but it does not. This is relevant when you process the data programmatically or convert it to another format (e.g., to a publishing program format). And the appearance is OK only when the font used has those combining marks.

It would be possible, and not particularly difficult, to create a keyboard layout (keyboard driver) that makes e.g. a common US keyboard work for polytonic Greek so that letter keys produce Greek letters in a natural way (A produces α etc.), though you need some special convention for letters like ω, and some punctuation keys produce combining diacritic marks. The main problem is that you would then produce letters in decomposed format (like U+03C9 U+0314 U+0342 U+0345). But that format might be acceptable, or you could programmatically convert (normalize) it to a format that uses precomposed characters (like U+1FA7).

Does Unicode have a unified way to input all diacritics, especially in MS word2010?

Related

Recent Posts