Aggregation of a data frame based on condition (Pyspark)
I need help with following task:
I have obtained a groupby results whereby I have a list of multiple addresses (here just a cut out with one address) with people occupying those addresses. I need to calculate ratio of usage of the app, so that I am dividing [name] + Active Count / [name] + Passive Count and creating a new dataframe with [address][name][usage_ratio]
I have never done aggregation alike and I have no idea where to start or how to loop it to execute it. Can anyone help?
+------------+--------------------+----------------+-----+
| address| name| use_of_app|count|
+------------+--------------------+----------------+-----+
| 33| Mark| active| 35|
| 33| Mark| passive| 4|
| 33| Abbie| active| 30|
| 33| Abbie| passive| 2|
| 33| Anna| passive| 3|
| 33| Anna| active| 32|
| 33| Tom| passive| 38|
| 33| Tom| active| 50|
| 33| Patrick| passive| 40|
| 33| Patrick| active| 57|
+------------+--------------------+----------------+-----+
Another option would be a pivot
-step:
from pyspark.sql import function as F
(
df.groupby("address", "name")
.pivot("use_of_app", values=["active", "passive"])
.agg(F.sum("count"))
.withColumn("ratio", F.col("active") / F.col("passive"))
.show()
)
# Output
+-------+-----+------+-------+------------------+
|address| name|active|passive| ratio|
+-------+-----+------+-------+------------------+
| 33|Abbie| 30| 2| 15.0|
| 33| Anna| 32| 3|10.666666666666666|
| 33| Mark| 35| 4| 8.75|
+-------+-----+------+-------+------------------+
+++
Updated according to the suggestion by Steven: .pivot("use_of_app", values=["active", "passive"])
.
Here is my code - I use a sum
on count
column because I am not sure how many lines of each use_of_app
you will have :
from pyspark.sql import functions as F
df = df.groupBy("address", "name").agg(
(
F.sum(F.when(F.col("use_of_app") == "active", F.col("count")))
/ F.sum(F.when(F.col("use_of_app") == "passive", F.col("count")))
).alias("usage_ratio")
)
df.show()
+-------+-------+------------------+
|address| name| usage_ratio|
+-------+-------+------------------+
| 33| Abbie| 15.0|
| 33| Mark| 8.75|
| 33| Tom|1.3157894736842106|
| 33| Anna|10.666666666666666|
| 33|Patrick| 1.425|
+-------+-------+------------------+