Pyspark - How to calculate file hashes

I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?

import os
import subprocess
import hashlib
import pandas as pd

class File:

    def __init__(self, path):
        self.path = path

        
     def get_hash(self):
        hash = hashlib.sha1()
        with open(self.path, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash.update(chunk)
        self.md5hash = hash.hexdigest()
        return self.md5hash

path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []


for path, subdirs, files in os.walk(path):
    for fi in files:
        if cnt < 10: #check on only 10 files for now as it takes ages!
            f = File(os.path.join(path, fi))
            cnt +=1
            hash_value = f.get_hash()
            results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
            rlist.append(results)

            print(fi)
df = pd.DataFrame(rlist)
     
print(str(cnt) + ' files processed')

df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)

Thanks


Solution 1:

Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))

For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles