New posts in apache-spark

What is the maximum size for a broadcast object in Spark?

apache-spark dataframe apache-spark-sql broadcast

Spark gives a StackOverflowError when training using ALS

apache-spark pyspark

Temp table caching with spark-sql

apache-spark apache-spark-sql

What is the difference between spark-submit and pyspark?

python apache-spark pyspark

Filtering DataFrame using the length of a column

python apache-spark dataframe pyspark apache-spark-sql

PySpark first and last function over a partition in one go

apache-spark pyspark apache-spark-sql pyspark-dataframes

dataframe: how to groupBy/count then filter on count in Scala

scala apache-spark apache-spark-sql

PySpark slice dataset adding a column until a condition

apache-spark pyspark apache-spark-sql window

Is it better to have one large parquet file or lots of smaller parquet files?

hadoop apache-spark parquet

How to create a sequence of timestamps in Scala

scala apache-spark date apache-spark-sql timestamp

Wrong sequence of months in PySpark sequence interval month

apache-spark pyspark apache-spark-sql

What is the difference between cube, rollup and groupBy operators?

sql apache-spark apache-spark-sql cube rollup

PySpark: match the values of a DataFrame column against another DataFrame column

python apache-spark pyspark

Why does Spark RDD partition has 2GB limit for HDFS?

scala apache-spark rdd

Exploding nested Struct in Spark dataframe

scala apache-spark apache-spark-sql distributed-computing databricks

How to get keys and values from MapType column in SparkSQL DataFrame

scala apache-spark dataframe apache-spark-sql apache-spark-dataset

Mind blown: RDD.zip() method

What will spark do if I don't have enough memory?

How to calculate the best numberOfPartitions for coalesce?

scala apache-spark rdd

How to find pyspark dataframe memory usage?

python apache-spark dataframe pyspark