Let's say we have two tables, trans and product. Hypothetically the trans table consists of over a billion rows of purchases bought by users.

I am trying to find paired products that are often purchased together(purchased on the same date) by the same user, such as wine and bottle openers, chips and beer, etc..

I am trying to find the top five paired products and their names.

trans and prod dataframe :-

trans = {'ID':[1,1,2,2,3,3,1,5,5,6,6,6],
        'productID':[11,22,11,22,33,77,11,77,88,11,22,77],
        'Year':['2022-01-01','2022-01-01','2020-01-05','2020-01-05','2019-01-01','2019-01-01','2020-01-07','2020-01-08',
                '2020-01-08','2021-06-01','2021-06-01','2021-06-01']}
trans = pd.DataFrame(trans)
trans['Year'] = pd.to_datetime(trans['Year'])
trans


product = {'productID':[11,22,33,44,55,77,88],
        'prodname':['phone','Charger','eaphones','headset','scratchgaurd','pin','cover']}
product = pd.DataFrame(product)
product

My code till now where was trying to Rank the items with same ID and Year and then try to get the product names.

transprod = pd.merge(trans,product,on='productID' , how='inner')
transprod


transprod['Rank'] = transprod.groupby('ID')['Year'].rank(method = 'dense').astype(int)
transprod = transprod.sort_values(['ID','productID','Rank'])
transprod

Desired Output:

Product 1 | Product 2 | Count
phone       charger      3
Charger      pin         1
eaphones     pin         1
pin         cover        1

Any help is really appreciated. Thanks in advance


You could group the transactions table by ID (and date) and list all product pairs for each order. itertools.combinations is useful here. By taking the set over an order first, you can ignore multiple equal items.

Since it does not matter in which order a pair appears, you could then construct a flat list of all the pairs and use a collections.Counter instance to count them. Sorting each pair first makes sure that you can disregard the order of items within a pair.

The product table can be transformed into a dictionary for easy lookup. This will provide a way to add the product names to the table of results.

from itertools import combinations
from collections import Counter

pairs_by_trans = trans.groupby(['ID', 'Year'])['productID'].agg(
                       lambda x: list(combinations(set(x), 2)))
pairs_flat = [tuple(sorted(pair)) for row in pairs_by_trans for pair in row]

counts = Counter(pairs_flat)
top_counts = pd.DataFrame(counts.most_common(5),
                          columns=['pair', 'count'])

prodname = {k: v for k, v in product.values}
top_counts['names'] = top_counts['pair'].apply(lambda x: (prodname[x[0]], 
                                                          prodname[x[1]]))

top_counts
    pair    count   names
0   (11, 22)    3   (phone, Charger)
1   (33, 77)    1   (eaphones, pin)
2   (77, 88)    1   (pin, cover)
3   (11, 77)    1   (phone, pin)
4   (22, 77)    1   (Charger, pin)