get specific row from spark dataframe
Firstly, you must understand that DataFrames
are distributed, that means you can't access them in a typical procedural way, you must do an analysis first. Although, you are asking about Scala
I suggest you to read the Pyspark Documentation, because it has more examples than any of the other documentations.
However, continuing with my explanation, I would use some methods of the RDD
API cause all DataFrame
s have one RDD
as attribute. Please, see my example bellow, and notice how I take the 2nd record.
df = sqlContext.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l, v), i): i == myIndex)
.map(lambda ((l,v), i): (l, v))
.collect())
print(values[0])
# (u'b', 2)
Hopefully, someone gives another solution with fewer steps.
This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding
val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")
val myRow7th = parquetFileDF.rdd.take(7).last
In PySpark, if your dataset is small (can fit into memory of driver), you can do
df.collect()[n]
where df
is the DataFrame object, and n
is the Row of interest. After getting said Row, you can do row.myColumn
or row["myColumn"]
to get the contents, as spelled out in the API docs.