Valid identifier characters in Scala

Working from the EBNF syntax in the spec:

upper ::= ‘A’ | ... | ‘Z’ | ‘$’ | ‘_’ and Unicode category Lu
lower ::= ‘a’ | ... | ‘z’ and Unicode category Ll
letter ::= upper | lower and Unicode categories Lo, Lt, Nl
digit ::= ‘0’ | ... | ‘9’
opchar ::= “all other characters in \u0020-007F and Unicode
            categories Sm, So except parentheses ([]) and periods”

But also taking into account the very beginning on Lexical Syntax that defines:

Parentheses ‘(’ | ‘)’ | ‘[’ | ‘]’ | ‘{’ | ‘}’.
Delimiter characters ‘‘’ | ‘’’ | ‘"’ | ‘.’ | ‘;’ | ‘,’

Here is what I come up with. Working by elimination in the range \u0020-007F, eliminating letters, digits, parentheses and delimiters, we have for opchar... (drumroll):

! # % & * + - / : < = > ? @ \ ^ | ~ and also Sm and So - except for parentheses and periods.

(Edit: adding valid examples here:). In summary, here are some valid examples that highlights all cases - watch out for \ in the REPL, I had to escape as \\:

val !#%&*+-/:<=>?@\^|~ = 1 // all simple opchars
val simpleName = 1 
val withDigitsAndUnderscores_ab_12_ab12 = 1 
val wordEndingInOpChars_!#%&*+-/:<=>?@\^|~ = 1
val !^©® = 1 // opchars ans symbols
val abcαβγ_!^©® = 1 // mixing unicode letters and symbols

Note 1:

I found this Unicode category index to figure out Lu, Ll, Lo, Lt, Nl:

  • Lu (uppercase letters)
  • Ll (lowercase letters)
  • Lo (other letters)
  • Lt (titlecase)
  • Nl (letter numbers like roman numerals)
  • Sm (symbol math)
  • So (symbol other)

Note 2:

val #^ = 1 // legal   - two opchars
val #  = 1 // illegal - reserved word like class or => or @
val +  = 1 // legal   - opchar
val &+ = 1 // legal   - two opchars
val &2 = 1 // illegal - opchar and letter do not mix arbitrarily
val £2 = 1 // working - £ is part of Sc (Symbol currency) - undefined by spec
val ¬  = 1 // legal   - part of Sm

Note 3:

Other operator-looking things that are reserved words: _ : = => <- <: <% >: # @ and also \u21D2 ⇒ and \u2190


The language specification. gives the rule in Chapter 1, lexical syntax (on page 3):

  1. Operator characters. These consist of all printable ASCII characters \u0020-\u007F. which are in none of the sets above, mathematical sym- bols(Sm) and other symbols(So).

This is basically the same as your extract of Programming in Programming in Scala. + is not an Unicode mathematical symbol, but it is definitely an ASCII printable character not listed above (not a letter, including _ or $, a digit, a paranthesis, a delimiter).

In your list:

  1. # is illegal not because the character is not an operator character (#^ is legal), but because it is a reserved word (on page 4), for type projection.
  2. &2 is illegal because you mix an operator character & and a non-operator character, digit 2
  3. £2 is legal because £ is not an operator character: it is not a seven bit ASCII, but 8 bit extended ASCII. It is not nice, as $ is not one either (it is considered a letter).