Performance wise which is better - Dataframe Map vs Expression in Polars?
I am new to polar. I want to create a new column based on multiple columns. I could see that Expression is powerful but for complex logic it is quite difficult to interpret with case
and when
.
So I tried the map
available in LazyFrame
and it looks like it serves the purpose. However, I am not sure if there will be a performance penality? Or Is there any other simpler method which I dont know of.
Below is my code with Map
let df = lf
.map(
|df: DataFrame| {
let a = &df["a"];
let b = &df["b"];
let r: Series = a
.f32()?
.into_iter()
.zip(b.f32()?.into_iter())
.map(|(Some(a), Some(b))| -> i32 {
if a * b == 10.0 {
10.0
} else if a * b == 20.0 {
a.cos();
} else {
b.cos()
}
})
.collect();
let df_new = DataFrame::new(vec![df["c"], df[r]])?;
Ok(df_new)
},
None,
None,
)
.select(&[
a.clone().max().alias("max"),
b.clone().min().alias("min"),
r.clone().mean().cast(DataType::Float32).alias("mean"),
])
.collect()?;
Compared to the Expression below,
let r = when((a * b).eq(lit::<f32>(10.0)))
.then(lit::<f32>(10.0))
.when((a * b).eq(lit::<f32>(20.0)))
.then(cos(a))
.otherwise(cos(b));
When you map a custom function over a DataFrame
you are saying trust me optimizer, I know what I am doing. We are not able to do any optimizations anymore.
Besides that, the expression are often executed in parallel. In the when -> then -> otherwise
expression you wrote, all branches are evaluated in parallel.
when((a * b).eq(lit::<f32>(10.0)))
.then(lit::<f32>(10.0))
.when((a * b).eq(lit::<f32>(20.0)))
.then(cos(a))
.otherwise(cos(b));
If its faster depend on the use case. I'd say benchmark.
However, you will get used to thinking in expressions and then the expression syntax turns out to be much more consice.