Python Pandas统计最频繁的事件 [英] Python Pandas count most frequent occurrences

查看:59
本文介绍了Python Pandas统计最频繁的事件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的示例数据框,其中包含有关订单的数据:

This is my sample data frame with data about orders:

import pandas as pd
my_dict = { 
     'status' : ["a", "b", "c", "d", "a","a", "d"],
     'city' : ["London","Berlin","Paris", "Berlin", "Boston", "Paris", "Boston"],
     'components': ["a01, a02, b01, b07, b08, с03, d07, e05, e06", 
                    "a01, b02, b35, b68, с43, d02, d07, e04, e05, e08", 
                    "a02, a05, b08, с03, d02, d06, e04, e05, e06", 
                    "a03, a26, a28, a53, b08, с03, d02, f01, f24", 
                    "a01, a28, a46, b37, с43, d06, e04, e05, f02", 
                    "a02, a05, b35, b68, с43, d02, d07, e04, e05, e08", 
                    "a02, a03, b08, b68, с43, d06, d07, e04, e05, e08"]
}
df = pd.DataFrame(my_dict)
df

我需要计数最多:

  1. 订单中的前n个同时出现的组件
  2. 最常见的前n个组件(无论是否同时出现)

什么是最好的方法?

我也可以看到与购物篮分析问题的关系,但不确定如何做.

I can see the relation to market basket analysis problem as well, but not sure how to do it.

推荐答案

@ScottBoston的答案显示了矢量化的方法(因此可能更快).

@ScottBoston's answer shows vectorized (hence probably faster) ways to achieve this.

发生率最高的

from collections import Counter
from itertools import chain

n = 3
individual_components = chain.from_iterable(df['components'].str.split(', '))
counter = Counter(individual_components)
print(counter.most_common(n))
# [('e05', 6), ('e04', 5), ('a02', 4)]


前n名同时发生

请注意,我两次使用 n ,一次用于共现的大小",一次用于"top-n"部分.显然,您可以使用2个不同的变量.

Note that I'm using n twice, once for "the size of the co-occurrence" and once for the "top-n" part. Obviously, you can use 2 different variables.

from collections import Counter
from itertools import combinations

n = 3
individual_components = []
for components in df['components']:
    order_components = sorted(components.split(', '))
    individual_components.extend(combinations(order_components, n))
counter = Counter(individual_components)
print(counter.most_common(n))
# [(('e04', 'e05', 'с43'), 4), (('a02', 'b08', 'e05'), 3), (('a02', 'd07', 'e05'), 3)]

这篇关于Python Pandas统计最频繁的事件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆