How to upgrade Spark to newer version?

I have a virtual machine which has Spark 1.3 on it but I want to upgrade it to Spark 1.5 primarily due certain supported functionalities which were not in 1.3. Is it possible I can upgrade the Spark version from 1.3 to 1.5 and if yes then how can I do that?


Pre-built Spark distributions, like the one I believe you are using based on another question of yours, are rather straightforward to "upgrade", since Spark is not actually "installed". Actually, all you have to do is:

  • Download the appropriate Spark distro (pre-built for Hadoop 2.6 and later, in your case)
  • Unzip the tar file in the appropriate directory (i.e.where folder spark-1.3.1-bin-hadoop2.6 already is)
  • Update your SPARK_HOME (and possibly some other environment variables depending on your setup) accordingly

Here is what I just did myself, to go from 1.3.1 to 1.5.2, in a setting similar to yours (vagrant VM running Ubuntu):

1) Download the tar file in the appropriate directory

vagrant@sparkvm2:~$ cd $SPARK_HOME
vagrant@sparkvm2:/usr/local/bin/spark-1.3.1-bin-hadoop2.6$ cd ..
vagrant@sparkvm2:/usr/local/bin$ ls
ipcluster     ipcontroller2  iptest   ipython2    spark-1.3.1-bin-hadoop2.6
ipcluster2    ipengine       iptest2  jsonschema
ipcontroller  ipengine2      ipython  pygmentize
vagrant@sparkvm2:/usr/local/bin$ sudo wget http://apache.tsl.gr/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
[...]
vagrant@sparkvm2:/usr/local/bin$ ls
ipcluster     ipcontroller2  iptest   ipython2    spark-1.3.1-bin-hadoop2.6
ipcluster2    ipengine       iptest2  jsonschema  spark-1.5.2-bin-hadoop2.6.tgz
ipcontroller  ipengine2      ipython  pygmentize

Notice that the exact mirror you should use with wget will be probably different than mine, depending on your location; you will get this by clicking the "Download Spark" link in the download page, after you have selected the package type to download.

2) Unpack the tgz file with

vagrant@sparkvm2:/usr/local/bin$ sudo tar -xzf spark-1.*.tgz
vagrant@sparkvm2:/usr/local/bin$ ls
ipcluster     ipcontroller2  iptest   ipython2    spark-1.3.1-bin-hadoop2.6
ipcluster2    ipengine       iptest2  jsonschema  spark-1.5.2-bin-hadoop2.6
ipcontroller  ipengine2      ipython  pygmentize  spark-1.5.2-bin-hadoop2.6.tgz

You can see that now you have a new folder, spark-1.5.2-bin-hadoop2.6.

3) Update accordingly SPARK_HOME (and possibly other environment variables you are using) to point to this new directory instead of the previous one.

And you should be done, after restarting your machine.

Notice that:

  1. You don't need to remove the previous Spark distribution, as long as all the relevant environment variables point to the new one. That way, you may even quickly move "back-and-forth" between the old and new version, in case you want to test things (i.e. you just have to change the relevant environment variables).
  2. sudo was necessary in my case; it may be unnecessary for you depending on your settings.
  3. After ensuring that everything works fine, it's good idea to delete the downloaded tgz file.
  4. You can use the exact same procedure to upgrade to future versions of Spark, as they come out (rather fast). If you do this, either make sure that previous tgz files have been deleted, or modify the tar command above to point to a specific file (i.e. no * wildcards as above).

  1. Set your SPARK_HOME to /opt/spark
  2. Download the latest pre-built binary i.e. spark-2.2.1-bin-hadoop2.7.tgz - can use wget
  3. Create the symlink to the latest download - ln -s /opt/spark-2.2.1 /opt/spark
  4. Edit files in $SPARK_HOME/conf accordingly

For every new version you download just create the symlink to it (step 3)

  • ln -s /opt/spark-x.x.x /opt/spark