How to delete columns in pyspark dataframe
>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
>>> a.join(b, a.id==b.id, 'outer')
DataFrame[id: bigint, julian_date: string, user_id: bigint, id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
There are two id: bigint
and I want to delete one. How can I do?
Solution 1:
Reading the Spark documentation I found an easier solution.
Since version 1.4 of spark there is a function drop(col)
which can be used in pyspark on a dataframe.
You can use it in two ways
df.drop('age')
df.drop(df.age)
Pyspark Documentation - Drop
Solution 2:
Adding to @Patrick's answer, you can use the following to drop multiple columns
columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)