How to efficiently get the mean of the elements in two list of lists in Python
Solution 1:
You can do it in O(n) (single pass over each list) by converting 1 to a dict, then per item in the 2nd list access that dict (in O(1)), like this:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
l1_as_dict = dict(mylist1)
myoutput = []
for item,price2 in mylist2:
if item in l1_as_dict:
price1 = l1_as_dict[item]
myoutput.append([item, (price1+price2)/2])
print(myoutput)
Output:
[['chocolate', 0.5], ['egg', 0.45]]
Solution 2:
An O(n)
solution that will average all items.
Construct a dictionary with a list of the values and then average that dictionary afterwards:
In []:
d = {}
for lst in (mylist1, mylist2):
for i, v in lst:
d.setdefault(i, []).append(v) # alternative use collections.defaultdict
[(k, sum(v)/len(v)) for k, v in d.items()]
Out[]:
[('lemon', 0.1), ('egg', 0.45), ('muffin', 0.3), ('chocolate', 0.5), ('milk', 0.2), ('carrot', 0.8)]
Then if you just want the common ones you can add a guard:
In []:
[(k, sum(v)/len(v)) for k, v in d.items() if len(v) > 1]
Out[]:
[('egg', 0.45), ('chocolate', 0.5)]
This extends to any number of lists and makes no assumption around the number of common elements.
Solution 3:
Here is one solution that uses collections.defaultdict
to group the items and calculates the averages with statistics.mean
:
from collections import defaultdict
from statistics import mean
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
d = defaultdict(list)
for lst in (mylist1, mylist2):
for k, v in lst:
d[k].append(v)
result = [[k, mean(v)] for k, v in d.items()]
print(result)
# [['lemon', 0.1], ['egg', 0.45], ['muffin', 0.3], ['chocolate', 0.5], ['milk', 0.2], ['carrot', 0.8]]
If we only want common keys, just check if the values are more than 1:
result = [[k, mean(v)] for k, v in d.items() if len(v) > 1]
print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
We could also just build the result from set intersection:
mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]
d1, d2 = dict(mylist1), dict(mylist2)
result = [[k, (d1[k] + d2[k]) / 2] for k in d1.keys() & d2.keys()]
print(result)
# [['egg', 0.45], ['chocolate', 0.5]]
Solution 4:
You can use the Pandas library to avoid writing any sort of loops yourself.
Your code would be really concise and clean.
Install Pandas like: pip install pandas
.
Then try this:
In [132]: import pandas as pd
In [109]: df1 = pd.DataFrame(mylist1)
In [110]: df2 = pd.DataFrame(mylist2)
In [117]: res = pd.merge(df1, df2, on=0)
In [121]: res['mean'] = res.mean(axis=1)
In [125]: res.drop(['1_x', '1_y'], 1, inplace=True)
In [131]: res.values.tolist()
Out[131]: [['egg', 0.45], ['chocolate', 0.5]]
Edit
Pandas is crazy fast because it uses numpy
under the hood. Numpy implements highly efficient array operations.
Please check the post : Why is Pandas so madly fast? for more details on calculating mean
through pure Python vs Pandas
.