New posts in apache-spark

Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

apache-spark spark-streaming

How do I check for equality using Spark Dataframe without SQL Query?

scala apache-spark dataframe apache-spark-sql

get datatype of column using pyspark

apache-spark pyspark apache-spark-sql

How to bootstrap installation of Python modules on Amazon EMR?

python amazon-web-services apache-spark emr

Reading csv files with quoted fields containing embedded commas

csv apache-spark pyspark apache-spark-sql apache-spark-2.0

How to write the resulting RDD to a csv file in Spark python

python csv apache-spark pyspark file-writing

How to prevent Spark Executors from getting Lost when using YARN client mode?

apache-spark hadoop-yarn

How to run ETL pipeline on Databricks (Python)

python apache-spark spark-streaming databricks amazon-kinesis

What's the difference between join and cogroup in Apache Spark

scala apache-spark

What's the difference between Spark ML and MLLIB packages

apache-spark apache-spark-mllib apache-spark-ml

Reading JSON with Apache Spark - `corrupt_record`

json scala apache-spark

How to convert Row of a Scala DataFrame into case class most efficiently?

scala apache-spark apache-spark-sql

Apply StringIndexer to several columns in a PySpark Dataframe

python apache-spark pyspark

Where are logs in Spark on YARN?

hadoop logging apache-spark cloudera hadoop-yarn

Convert a spark DataFrame to pandas DF

pandas apache-spark apache-spark-sql

Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?

How do I iterate RDD's in apache spark (scala)

scala apache-spark

Overwrite only some partitions in a partitioned spark Dataset

apache-spark hive apache-spark-dataset

Reading DataFrame from partitioned parquet file

scala apache-spark parquet spark-dataframe

What is yarn-client mode in Spark?

hadoop-yarn apache-spark