How to upgrade Spark to newer version?
I have a virtual machine which has Spark 1.3
on it but I want to upgrade it to Spark 1.5
primarily due certain supported functionalities which were not in 1.3. Is it possible I can upgrade the Spark
version from 1.3
to 1.5
and if yes then how can I do that?
Pre-built Spark distributions, like the one I believe you are using based on another question of yours, are rather straightforward to "upgrade", since Spark is not actually "installed". Actually, all you have to do is:
- Download the appropriate Spark distro (pre-built for Hadoop 2.6 and later, in your case)
- Unzip the tar file in the appropriate directory (i.e.where folder
spark-1.3.1-bin-hadoop2.6
already is) - Update your
SPARK_HOME
(and possibly some other environment variables depending on your setup) accordingly
Here is what I just did myself, to go from 1.3.1 to 1.5.2, in a setting similar to yours (vagrant VM running Ubuntu):
1) Download the tar file in the appropriate directory
vagrant@sparkvm2:~$ cd $SPARK_HOME
vagrant@sparkvm2:/usr/local/bin/spark-1.3.1-bin-hadoop2.6$ cd ..
vagrant@sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema
ipcontroller ipengine2 ipython pygmentize
vagrant@sparkvm2:/usr/local/bin$ sudo wget http://apache.tsl.gr/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
[...]
vagrant@sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema spark-1.5.2-bin-hadoop2.6.tgz
ipcontroller ipengine2 ipython pygmentize
Notice that the exact mirror you should use with wget
will be probably different than mine, depending on your location; you will get this by clicking the "Download Spark" link in the download page, after you have selected the package type to download.
2) Unpack the tgz
file with
vagrant@sparkvm2:/usr/local/bin$ sudo tar -xzf spark-1.*.tgz
vagrant@sparkvm2:/usr/local/bin$ ls
ipcluster ipcontroller2 iptest ipython2 spark-1.3.1-bin-hadoop2.6
ipcluster2 ipengine iptest2 jsonschema spark-1.5.2-bin-hadoop2.6
ipcontroller ipengine2 ipython pygmentize spark-1.5.2-bin-hadoop2.6.tgz
You can see that now you have a new folder, spark-1.5.2-bin-hadoop2.6
.
3) Update accordingly SPARK_HOME
(and possibly other environment variables you are using) to point to this new directory instead of the previous one.
And you should be done, after restarting your machine.
Notice that:
- You don't need to remove the previous Spark distribution, as long as all the relevant environment variables point to the new one. That way, you may even quickly move "back-and-forth" between the old and new version, in case you want to test things (i.e. you just have to change the relevant environment variables).
-
sudo
was necessary in my case; it may be unnecessary for you depending on your settings. - After ensuring that everything works fine, it's good idea to delete the downloaded
tgz
file. - You can use the exact same procedure to upgrade to future versions of Spark, as they come out (rather fast). If you do this, either make sure that previous
tgz
files have been deleted, or modify thetar
command above to point to a specific file (i.e. no*
wildcards as above).
- Set your
SPARK_HOME
to/opt/spark
-
Download the latest pre-built binary i.e.
spark-2.2.1-bin-hadoop2.7.tgz
- can usewget
- Create the symlink to the latest download -
ln -s /opt/spark-2.2.1 /opt/spark
- Edit files in
$SPARK_HOME/conf
accordingly
For every new version you download just create the symlink to it (step 3)
ln -s /opt/spark-x.x.x /opt/spark