get specific row from spark dataframe

Firstly, you must understand that DataFrames are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.

However, continuing with my explanation, I would use some methods of the RDD API cause all DataFrames have one RDD as attribute. Please, see my example bellow, and notice how I take the 2nd record.

df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
            .filter(lambda ((l, v), i): i == myIndex)
            .map(lambda ((l,v), i): (l, v))
            .collect())

print(values[0])
# (u'b', 2)

Hopefully, someone gives another solution with fewer steps.

This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding

val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")

val myRow7th = parquetFileDF.rdd.take(7).last

In PySpark, if your dataset is small (can fit into memory of driver), you can do

df.collect()[n]

where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.

get specific row from spark dataframe

Related

Recent Posts