How to install a library on a databricks cluster using some command in the notebook?
Actaully i want to install a library on my Azure databricks cluster but i cannot use the UI method. it is because everytime my cluster would change and in transition i cannot add library to it using UI. Is there any databricks utility command for doing this?
Solution 1:
There are different methods to install packages in Azure Databricks:
GUI Method
Method1: Using libraries
To make third-party or locally-built code available to notebooks and jobs running on your clusters, you can install a library. Libraries can be written in Python, Java, Scala, and R. You can upload Java, Scala, and Python libraries and point to external packages in PyPI, Maven, and CRAN repositories.
Steps to install third party libraries:
Step1: Create Databricks Cluster.
Step2: Select the cluster created.
Step3: Select Libraries => Install New => Select Library Source = "Maven" => Coordinates => Search Packages => Select Maven Central => Search for the package required. Example: (GDAL) => Select the version (3.0.0) required => Install
Notebook methods
Method2: Using Cluster-scoped init scripts
Cluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to both clusters you create and those created to run jobs. Since the scripts are part of the cluster configuration, cluster access control lets you control who can change the scripts.
Step1: Add the DBFS path dbfs:/databricks/scripts/gdal_install.sh to the cluster init scripts
# --- Run 1x to setup the init script. ---
# Restart cluster after running.
dbutils.fs.put("/databricks/scripts/gdal_install.sh","""
#!/bin/bash
sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get install -y cmake gdal-bin libgdal-dev python3-gdal""",
True)
Step2: Restart the cluster after running step1 for the first time.
Method3: Python packages are installed in the Spark container using pip install.
Using pip to install "psutil" library.
Method4: Library utilities
The library utility is deprecated.
Library utilities allow you to install Python libraries and create an environment scoped to a notebook session. The libraries are available both on the driver and on the executors, so you can reference them in UDFs. This enables:
Library dependencies of a notebook to be organized within the notebook itself. Notebook users with different library dependencies to share a cluster without interference.
CLI & API Methods
Method5: Libraries CLI
You run Databricks libraries CLI subcommands by appending them to databricks libraries.
databricks libraries -h
Install a JAR from DBFS:
databricks libraries install --cluster-id $CLUSTER_ID --jar dbfs:/test-dir/test.jar
Method6: Libraries API
The Libraries API allows you to install and uninstall libraries and get the status of libraries on a cluster.
Install libraries on a cluster. The installation is asynchronous - it completes in the background after the request. 2.0/libraries/install
Example Request:
{
"cluster_id": "10201-my-cluster",
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
},
{
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": ["slf4j:slf4j"]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "https://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}