Databricks spark.readstream format differences
I am having confusion on the difference of the following code in Databricks
spark.readStream.format('json')
vs
spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json')
I know cloudfiles as the format would be regarded as Databricks Autoloader . In performance/function comparison , which one is better ? Anyone has some experience on that?
Thanks
Solution 1:
There are multiple differences between these two. When you use Auto Loader you get at least, there are more things (see doc for all details):
- Better performance, scalability and cost efficiency when discovering new files. You can either use file notification mode (when you get notified about new files using cloud-native integration) or optimized file listing mode that uses native cloud APIs to list files and directories. Spark's file streaming is relying on the Hadoop APIs that are much slower, especially if you have a lot of nested directories and a lot of files
- Support for schema inference and evolution. With Auto Loader you can detect changes in the schema for JSON/CSV/Avro, and adjust it to process new fields.