Why is Apache-Spark - Python so slow locally as compared to pandas?
A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy
, sum
, max
, stddev
.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
I was wondering what could be a possible reason for this. I have a couple of thoughts.
- Do built-in functions do the process of serialization/de-serialization inefficiently? If yes, what are the alternatives to them?
- Is the data set too small that it cannot outrun the overhead cost of the underlying JVM on which spark runs?
Thanks for looking. Much appreciated.
Solution 1:
Because:
- Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
- Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
- Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
- Because local mode is not designed for performance. It is used for testing.
- Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
- Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison
You can go on like this for a long time...