Valid identifier characters in Scala
Working from the EBNF syntax in the spec:
upper ::= ‘A’ | ... | ‘Z’ | ‘$’ | ‘_’ and Unicode category Lu
lower ::= ‘a’ | ... | ‘z’ and Unicode category Ll
letter ::= upper | lower and Unicode categories Lo, Lt, Nl
digit ::= ‘0’ | ... | ‘9’
opchar ::= “all other characters in \u0020-007F and Unicode
categories Sm, So except parentheses ([]) and periods”
But also taking into account the very beginning on Lexical Syntax that defines:
Parentheses ‘(’ | ‘)’ | ‘[’ | ‘]’ | ‘{’ | ‘}’.
Delimiter characters ‘‘’ | ‘’’ | ‘"’ | ‘.’ | ‘;’ | ‘,’
Here is what I come up with. Working by elimination in the range \u0020-007F
, eliminating letters, digits, parentheses and delimiters, we have for opchar
... (drumroll):
! # % & * + - / : < = > ? @ \ ^ | ~
and also Sm
and So
- except for parentheses and periods.
(Edit: adding valid examples here:). In summary, here are some valid examples that highlights all cases - watch out for \
in the REPL, I had to escape as \\
:
val !#%&*+-/:<=>?@\^|~ = 1 // all simple opchars
val simpleName = 1
val withDigitsAndUnderscores_ab_12_ab12 = 1
val wordEndingInOpChars_!#%&*+-/:<=>?@\^|~ = 1
val !^©® = 1 // opchars ans symbols
val abcαβγ_!^©® = 1 // mixing unicode letters and symbols
Note 1:
I found this Unicode category index to figure out Lu, Ll, Lo, Lt, Nl
:
- Lu (uppercase letters)
- Ll (lowercase letters)
- Lo (other letters)
- Lt (titlecase)
- Nl (letter numbers like roman numerals)
- Sm (symbol math)
- So (symbol other)
Note 2:
val #^ = 1 // legal - two opchars
val # = 1 // illegal - reserved word like class or => or @
val + = 1 // legal - opchar
val &+ = 1 // legal - two opchars
val &2 = 1 // illegal - opchar and letter do not mix arbitrarily
val £2 = 1 // working - £ is part of Sc (Symbol currency) - undefined by spec
val ¬ = 1 // legal - part of Sm
Note 3:
Other operator-looking things that are reserved words: _ : = => <- <: <% >: # @
and also \u21D2
⇒ and \u2190
←
The language specification. gives the rule in Chapter 1, lexical syntax (on page 3):
- Operator characters. These consist of all printable ASCII characters \u0020-\u007F. which are in none of the sets above, mathematical sym- bols(Sm) and other symbols(So).
This is basically the same as your extract of Programming in Programming in Scala. +
is not an Unicode mathematical symbol, but it is definitely an ASCII printable character not listed above (not a letter, including _ or $, a digit, a paranthesis, a delimiter).
In your list:
- # is illegal not because the character is not an operator character (#^ is legal), but because it is a reserved word (on page 4), for type projection.
- &2 is illegal because you mix an operator character & and a non-operator character, digit 2
- £2 is legal because £ is not an operator character: it is not a seven bit ASCII, but 8 bit extended ASCII. It is not nice, as
$
is not one either (it is considered a letter).