How do we access a file in github repo inside our azure databricks notebook
We have a requirement where we need to access a file hosted on our github private repo in our Azure Databricks notebook. Currently we are doing it using curl command using the Personal Access Token of a user.
curl -H 'Authorization: token INSERTACCESSTOKENHERE' -H 'Accept:
application/vnd.github.v3.raw' -O -L
https://api.github.com/repos/*owner*/*repo*/contents/*path*
Is there a way we can avoid the use of PAT and use deploy keys or anything?
Solution 1:
From summer 2021 databricks has introduced integration of git repos functionality. More info here: https://docs.microsoft.com/en-us/azure/databricks/repos
If you add your file (excel, json etc.) in the repo, then you can use a relative path to access it and read it.
e.g. pd.read_excel("./test_data.xlsx")
Be aware that you need a cluster with a databricks version 8.4+ (or 9.1+?)
You can also test what is your current working directory by executing the following command. os.getcwd()
If you have correctly integrated the repo then your result should be something like:
/Workspace/Repos/[email protected]/REPO_FOLDER/analysis
otherwise it will be something like: /databricks/driver