I would like to iterate over some files in a folder that has its path in databricks Repos. How would one do this? I don't seem to be able to access the files in Repos

I have added a picture that shows what folders i would like to access (the dbrks & sql folders)

Thanks :)

Image of the repo folder hierarchy


Solution 1:

You can read files from repo folders. The path is /mnt/repos/, this is the top folder when opening the repo window. You can then iterate yourself over these files.

Whenever you find the file you want you can read it with (for example) Spark. Example if you want to read a CSV file.

spark.read.format("csv").load(
        path, header=True, inferSchema=True, delimiter=";"
    )

Solution 2:

If you just want to list files in the repositories, then you can use the list command of Workspace REST API. Using it you can implement recursive listing of files. The actual implementation would different, based on your requirements, like, if you need to generate a list of full paths vs. list with subdirectories, etc. This could be something like this (not tested):

import requests
my_pat = "generated personal access token"
workspace_url = "https://name-of-workspace"
def list_files(base_path: str):
  lst = requests.request(method='get', 
    url=f"{workspace_url}/api/2.0/workspace/list", 
    headers={"Authentication": f"Bearer {my_pat}",
    json={"path": base_path}).json()["objects"]
  results = []
  for i in lst:
    if i["object_type"] == "DIRECTORY" or i["object_type"] == "REPO":
      results.extend(list_files(i["path"]))
    else:
      results.append(i["path"])
  
  return results
  
all_files = list_files("/Repos/<my-initial-folder")

But if you want to read a content of the files in the repository, then you need to use so-called Arbitrary Files support that is available since DBR 8.4.