Pyspark, create RDD with line number and list of words in line

I'm working with a plain text file and am trying to create an RDD that consists of the line number and a list of the words contained in the line.

I create the RDD as:

corpus = sc.textFile('article.txt')

Then I do a zipWithIndex and a map to get the line number and the text:

RDD2=RDD.zipWithIndex().map(lambda x: (x[1], x[0]))
for element in RDD2.take(2):
    print(element)

Which results in:

(0, 'This is the 100th Etext file presented by Project Gutenberg, and')
(1, 'is presented in cooperation with World Library, Inc., from their')

How do I proceed to convert the text to a list? I'd appreciate any suggestions.


Solution 1:

You can try with split

RDD2=RDD.zipWithIndex().map(lambda x: (x[1], x[0].split()))
for element in RDD2.take(2):
    print(element)

Solution 2:

If you want to do it with DataFrame instead of RDD:

from pyspark.sql import functions as sf
df = spark.createDataFrame(RDD2, schema="row_num: int, line: string") # convert to DataFrame
df2 = df.withColumn("words", sf.split(df.line,"\s+")).drop("line") # split on white spaces and drop the original line
df2.show(10)