Pyspark, create RDD with line number and list of words in line
I'm working with a plain text file and am trying to create an RDD that consists of the line number and a list of the words contained in the line.
I create the RDD as:
corpus = sc.textFile('article.txt')
Then I do a zipWithIndex and a map to get the line number and the text:
RDD2=RDD.zipWithIndex().map(lambda x: (x[1], x[0]))
for element in RDD2.take(2):
print(element)
Which results in:
(0, 'This is the 100th Etext file presented by Project Gutenberg, and')
(1, 'is presented in cooperation with World Library, Inc., from their')
How do I proceed to convert the text to a list? I'd appreciate any suggestions.
Solution 1:
You can try with split
RDD2=RDD.zipWithIndex().map(lambda x: (x[1], x[0].split()))
for element in RDD2.take(2):
print(element)
Solution 2:
If you want to do it with DataFrame instead of RDD:
from pyspark.sql import functions as sf
df = spark.createDataFrame(RDD2, schema="row_num: int, line: string") # convert to DataFrame
df2 = df.withColumn("words", sf.split(df.line,"\s+")).drop("line") # split on white spaces and drop the original line
df2.show(10)