What is a "symbol" in Julia?

Specifically: I am trying to use Julia's DataFrames package, specifically the readtable() function with the names option, but that requires a vector of symbols.

  • what is a symbol?
  • why would they choose that over a vector of strings?

So far I have found only a handful of references to the word symbol in the Julia language. It seems that symbols are represented by ":var", but it is far from clear to me what they are.

Aside: I can run

df = readtable( "table.txt", names = [symbol("var1"), symbol("var2")] )

My two bulleted questions still stand.


Solution 1:

Symbols in Julia are the same as in Lisp, Scheme or Ruby. However, the answers to those related questions are not really satisfactory, in my opinion. If you read those answers, it seems that the reason a symbol is different than a string is that strings are mutable while symbols are immutable, and symbols are also "interned" – whatever that means. Strings do happen to be mutable in Ruby and Lisp, but they aren't in Julia, and that difference is actually a red herring. The fact that symbols are interned – i.e. hashed by the language implementation for fast equality comparisons – is also an irrelevant implementation detail. You could have an implementation that doesn't intern symbols and the language would be exactly the same.

So what is a symbol, really? The answer lies in something that Julia and Lisp have in common – the ability to represent the language's code as a data structure in the language itself. Some people call this "homoiconicity" (Wikipedia), but others don't seem to think that alone is sufficient for a language to be homoiconic. But the terminology doesn't really matter. The point is that when a language can represent its own code, it needs a way to represent things like assignments, function calls, things that can be written as literal values, etc. It also needs a way to represent its own variables. I.e., you need a way to represent – as data – the foo on the left hand side of this:

foo == "foo"

Now we're getting to the heart of the matter: the difference between a symbol and a string is the difference between foo on the left hand side of that comparison and "foo" on the right hand side. On the left, foo is an identifier and it evaluates to the value bound to the variable foo in the current scope. On the right, "foo" is a string literal and it evaluates to the string value "foo". A symbol in both Lisp and Julia is how you represent a variable as data. A string just represents itself. You can see the difference by applying eval to them:

julia> eval(:foo)
ERROR: foo not defined

julia> foo = "hello"
"hello"

julia> eval(:foo)
"hello"

julia> eval("foo")
"foo"

What the symbol :foo evaluates to depends on what – if anything – the variable foo is bound to, whereas "foo" always just evaluates to "foo". If you want to construct expressions in Julia that use variables, then you're using symbols (whether you know it or not). For example:

julia> ex = :(foo = "bar")
:(foo = "bar")

julia> dump(ex)
Expr
  head: Symbol =
  args: Array{Any}((2,))
    1: Symbol foo
    2: String "bar"
  typ: Any

What that dumped out stuff shows, among other things, is that there's a :foo symbol object inside of the expression object you get by quoting the code foo = "bar". Here's another example, constructing an expression with the symbol :foo stored in the variable sym:

julia> sym = :foo
:foo

julia> eval(sym)
"hello"

julia> ex = :($sym = "bar"; 1 + 2)
:(begin
        foo = "bar"
        1 + 2
    end)

julia> eval(ex)
3

julia> foo
"bar"

If you try to do this when sym is bound to the string "foo", it won't work:

julia> sym = "foo"
"foo"

julia> ex = :($sym = "bar"; 1 + 2)
:(begin
        "foo" = "bar"
        1 + 2
    end)

julia> eval(ex)
ERROR: syntax: invalid assignment location ""foo""

It's pretty clear to see why this won't work – if you tried to assign "foo" = "bar" by hand, it also won't work.

This is the essence of a symbol: a symbol is used to represent a variable in metaprogramming. Once you have symbols as a data type, of course, it becomes tempting to use them for other things, like as hash keys. But that's an incidental, opportunistic usage of a data type that has another primary purpose.

Note that I stopped talking about Ruby a while back. That's because Ruby isn't homoiconic: Ruby doesn't represent its expressions as Ruby objects. So Ruby's symbol type is kind of a vestigial organ – a leftover adaptation, inherited from Lisp, but no longer used for its original purpose. Ruby symbols have been co-opted for other purposes – as hash keys, to pull methods out of method tables – but symbols in Ruby are not used to represent variables.

As to why symbols are used in DataFrames rather than strings, it's because it's a common pattern in DataFrames to bind column values to variables inside of user-provided expressions. So it's natural for column names to be symbols, since symbols are exactly what you use to represent variables as data. Currently, you have to write df[:foo] to access the foo column, but in the future, you may be able to access it as df.foo instead. When that becomes possible, only columns whose names are valid identifiers will be accessible with this convenient syntax.

See also:

  • https://docs.julialang.org/en/latest/manual/metaprogramming/
  • In what sense are languages like Elixir and Julia homoiconic?

Solution 2:

In reference to the original question as of now, i.e. 0.21 release (and in the future) DataFrames.jl allows both Symbols and strings to be used as column names as it is not a problem to support both and in different situations either Symbol or string might be preferred by the user.

Here is an example:

julia> using DataFrames

julia> df = DataFrame(:a => 1:2, :b => 3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> DataFrame("a" => 1:2, "b" => 3:4) # this is the same
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> df[:, :a]
2-element Array{Int64,1}:
 1
 2

julia> df[:, "a"] # this is the same
2-element Array{Int64,1}:
 1
 2