Azure Databricks: create audit trail for who ran what query at what moment
We have an audit requirement to provide insight into who executed what query at what moment in Azure Databricks. The Azure Databricks / Spark UI / Jobs tab already lists the Spark jobs executed including the query done and the time it was submitted. But it does not include who executed the query.
- Is there an API we can use with Azure Databricks to query these Spark job details shown in the UI? (The Databricks REST API does not seem to provide this, but maybe I'm overlooking something)
- Is there a way we can determine who created the Spark job (using an API)
Thanks, Gero
1. Access the Spark API
a. Driver node (internal) access the Azure Databricks Spark api:
import requests
driverIp = spark.conf.get('spark.driver.host')
port = spark.conf.get("spark.ui.port")
url = F"http://{driverIp}:{port}/api/v1/applications"
r = requests.get(url, timeout=3.0)
r.status_code, r.text
If for example you received this error message from public API:
PERMISSION_DENIED: Traffic on this port is not permitted
b. External Access to the Azure Databricks Spark API:
import requests
import json
"""
Program access to Databricks Spark UI.
Works external to Databricks environment or running within.
Requires a Personal Access Token. Treat this like a password, do not store in a notebook. Please refer to the Secrets API.
This Python code requires F string support.
"""
# https://<databricks-host>/driver-proxy-api/o/0/<cluster_id>/<port>/api/v1/applications/<application-id-from-master-spark-ui>/stages/<stage-id>
port = spark.conf.get("spark.ui.port")
clusterId = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
host = "eastus2.azuredatabricks.net"
workspaceId = "999999999999111" # follows the 'o=' in the databricks URLs or zero
token = "dapideedeadbeefdeadbeefdeadbeef68ee3" # Personal Access token
url = F"https://{host}/driver-proxy-api/o/{workspaceId}/{clusterId}/{port}/api/v1/applications/?status=running"
r = requests.get(url, auth=("token", token))
# print Application list response
print(r.status_code, r.text)
applicationId = r.json()[0]['id'] # assumes only one response
url = F"https://{host}/driver-proxy-api/o/{workspaceId}/{clusterId}/{port}/api/v1/applications/{applicationId}/jobs"
r = requests.get(url, auth=("token", token))
print(r.status_code, r.json())
2. Sorry, no, not at this time.
The cluster logs would be where you'd look, but the user identity is not there.
To vote and track this idea: https://ideas.databricks.com/ideas/DBE-I-313 How to get to the Ideas portal: https://docs.databricks.com/ideas.html