How do we access a file in github repo inside our azure databricks notebook

We have a requirement where we need to access a file hosted on our github private repo in our Azure Databricks notebook. Currently we are doing it using curl command using the Personal Access Token of a user.

curl -H 'Authorization: token INSERTACCESSTOKENHERE' -H 'Accept: 
application/vnd.github.v3.raw' -O -L 
https://api.github.com/repos/*owner*/*repo*/contents/*path*

Is there a way we can avoid the use of PAT and use deploy keys or anything?


Solution 1:

From summer 2021 databricks has introduced integration of git repos functionality. More info here: https://docs.microsoft.com/en-us/azure/databricks/repos

If you add your file (excel, json etc.) in the repo, then you can use a relative path to access it and read it.

e.g. pd.read_excel("./test_data.xlsx")

Be aware that you need a cluster with a databricks version 8.4+ (or 9.1+?)

You can also test what is your current working directory by executing the following command. os.getcwd()

If you have correctly integrated the repo then your result should be something like:

/Workspace/Repos/[email protected]/REPO_FOLDER/analysis

otherwise it will be something like: /databricks/driver