New posts in apache-spark

Is there a reason not to use SparkContext.getOrCreate when writing a spark job?

scala apache-spark cassandra datastax

Spark DataFrames when udf functions do not accept large enough input variables

scala apache-spark dataframe apache-spark-sql apache-spark-mllib

Amazon s3a returns 400 Bad Request with Spark

amazon-web-services amazon-s3 apache-spark hdfs spark-streaming

Saving dataframe to local file system results in empty results

apache-spark amazon-emr

Spark load data and add filename as dataframe column

apache-spark pyspark apache-spark-sql

Dealing with a large gzipped file in Spark

apache-spark gzip amazon-emr

Encode an ADT / sealed trait hierarchy into Spark DataSet column

scala apache-spark apache-spark-dataset apache-spark-encoders

Joining many large files on AWS

amazon-web-services apache-spark bigdata

Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

scala apache-spark

Operate on neighbor elements in RDD in Spark

scala apache-spark

How to control preferred locations of RDD partitions?

apache-spark pyspark rdd

What is the difference between Apache Mahout and Apache Spark's MLlib?

apache-spark mahout apache-spark-mllib

Why does sortBy transformation trigger a Spark job?

apache-spark rdd partitioning partitioner

'PipelinedRDD' object has no attribute 'toDF' in PySpark

python apache-spark pyspark apache-spark-sql rdd

Why does partition parameter of SparkContext.textFile not take effect?

scala apache-spark rdd

Convert date from String to Date format in Dataframes

apache-spark apache-spark-sql

PySpark in iPython notebook raises Py4JJavaError when using count() and first()

python apache-spark pyspark virtualenv ipython-notebook

How to group by common element in array?

apache-spark apache-spark-sql

Spark: Transpose DataFrame Without Aggregating

scala apache-spark

ALS model - how to generate full_u * v^t * v?

apache-spark apache-spark-mllib apache-spark-ml