How to efficiently get the mean of the elements in two list of lists in Python

Solution 1:

You can do it in O(n) (single pass over each list) by converting 1 to a dict, then per item in the 2nd list access that dict (in O(1)), like this:

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

l1_as_dict = dict(mylist1)

myoutput = []
for item,price2 in mylist2:
    if item in l1_as_dict:
        price1 = l1_as_dict[item]
        myoutput.append([item, (price1+price2)/2])

print(myoutput)

Output:

[['chocolate', 0.5], ['egg', 0.45]]

Solution 2:

An O(n) solution that will average all items.
Construct a dictionary with a list of the values and then average that dictionary afterwards:

In []:
d = {}
for lst in (mylist1, mylist2):
    for i, v in lst:
        d.setdefault(i, []).append(v)   # alternative use collections.defaultdict

[(k, sum(v)/len(v)) for k, v in d.items()]

Out[]:
[('lemon', 0.1), ('egg', 0.45), ('muffin', 0.3), ('chocolate', 0.5), ('milk', 0.2), ('carrot', 0.8)]

Then if you just want the common ones you can add a guard:

In []:
[(k, sum(v)/len(v)) for k, v in d.items() if len(v) > 1]

Out[]:
[('egg', 0.45), ('chocolate', 0.5)]

This extends to any number of lists and makes no assumption around the number of common elements.

Solution 3:

Here is one solution that uses collections.defaultdict to group the items and calculates the averages with statistics.mean:

from collections import defaultdict
from statistics import mean

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

d = defaultdict(list)
for lst in (mylist1, mylist2):
    for k, v in lst:
        d[k].append(v)

result = [[k, mean(v)] for k, v in d.items()]

print(result)
# [['lemon', 0.1], ['egg', 0.45], ['muffin', 0.3], ['chocolate', 0.5], ['milk', 0.2], ['carrot', 0.8]]

If we only want common keys, just check if the values are more than 1:

result = [[k, mean(v)] for k, v in d.items() if len(v) > 1]

print(result)
# [['egg', 0.45], ['chocolate', 0.5]]

We could also just build the result from set intersection:

mylist1 = [["lemon", 0.1], ["egg", 0.1], ["muffin", 0.3], ["chocolate", 0.5]]
mylist2 = [["chocolate", 0.5], ["milk", 0.2], ["carrot", 0.8], ["egg", 0.8]]

d1, d2 = dict(mylist1), dict(mylist2)

result = [[k, (d1[k] + d2[k]) / 2] for k in d1.keys() & d2.keys()]

print(result)
# [['egg', 0.45], ['chocolate', 0.5]]

Solution 4:

You can use the Pandas library to avoid writing any sort of loops yourself.

Your code would be really concise and clean.

Install Pandas like: pip install pandas.

Then try this:

In [132]: import pandas as pd

In [109]: df1 = pd.DataFrame(mylist1)

In [110]: df2 = pd.DataFrame(mylist2)

In [117]: res = pd.merge(df1, df2, on=0)

In [121]: res['mean'] = res.mean(axis=1)

In [125]: res.drop(['1_x', '1_y'], 1, inplace=True)

In [131]: res.values.tolist()
Out[131]: [['egg', 0.45], ['chocolate', 0.5]]

Edit

Pandas is crazy fast because it uses numpy under the hood. Numpy implements highly efficient array operations.

Please check the post : Why is Pandas so madly fast? for more details on calculating mean through pure Python vs Pandas.