why my simple spark code can not print anything?

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    val rdd1: RDD[Int] = sc.makeRDD(List(2, 4, 6, 8), 2)

    // just print datas partition info then reture partition datas with no changes
    val rdd2: RDD[Int] = rdd1.mapPartitionsWithIndex((par, datas) => {
      println("data and partition info : par = " + par + " datas = " + datas.mkString(" "))
      datas // return datas again
    })

    // i think there are 2,4,6,8 four elements in rdd2
    // so i foreach rdd2 but nothing output, why this happen?
    rdd2.collect().foreach(println)

    sc.stop()
  }

i am studying spark and i write a simple demo code with spark. but there is some question i do not understand. i can not figure out why the code rdd2.collect().foreach(println) can not print anything ?


Solution 1:

Your problem is that you are returning an Iterator in the mapPartition function which is already traversed when you use the mkString function. Iterators are special collections that help to deal with large partitions reading elements one by one. They are used in the different functions of the RDD api like forEach, mapPartition, zipPartition, etc .... Take a look at how they work. And pay attention to this statement: "one should never use an iterator after calling a method on it.". Drop the println line and it should work.