To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().

Any idea? :)


Solution 1:

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

  • Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.

  • Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

Solution 2:

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

Solution 3:

By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.

Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.

By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).

Solution 4:

The only thing that matters is what kind of underlying algorithm is used for grouping. HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)