基于数据帧列中的匹配数的统计 [英] Statistics based on number of matches in dataframe column

查看:145
本文介绍了基于数据帧列中的匹配数的统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种基于DF列中匹配数量的Pythonic方法来捕获统计信息。所以使用这个例子:

I'm looking for a Pythonic approach to capture stats based on the amount of matches in a DF column. So working with this example:

rng = pd.DataFrame( {'initial_data': ['A', 'A','A', 'A', 'B','B', 'A' , 'A', 'A', 'A','B' , 'B', 'B', 'A',]},  index = pd.date_range('4/2/2014', periods=14, freq='BH'))
test_B_mask = rng['initial_data'] == 'B'
rng['test_for_B'] = rng['initial_data'][test_B_mask]

并运行此函数提供匹配:

and running this function to provide matches:

def func_match(df_in,val):
    return ((df_in == val) & (df_in.shift() == val)).astype(int)
func_match(rng['test_for_B'],rng['test_for_B'])

我得到以下输出:

2014-04-02 09:00:00    0
2014-04-02 10:00:00    0
2014-04-02 11:00:00    0
2014-04-02 12:00:00    0
2014-04-02 13:00:00    0
2014-04-02 14:00:00    1
2014-04-02 15:00:00    0
2014-04-02 16:00:00    0
2014-04-03 09:00:00    0
2014-04-03 10:00:00    0
2014-04-03 11:00:00    0
2014-04-03 12:00:00    1
2014-04-03 13:00:00    1
2014-04-03 14:00:00    0
Freq: BH, Name: test_for_B, dtype: int64

我可以使用简单的东西,如 func_match rng ['test_for_B'],rng ['test_for_B'])sum()
返回

I can use something simple like func_match(rng['test_for_B'],rng['test_for_B']).sum() which returns

3

获取金额如果值匹配的总数,但有人可以帮助具有提供以下更细粒度功能的功能?

to get the amount if times the values match in total but could someone help with a function to provide the following more granular function please?


  • 看到单个匹配的次数和百分比。

  • 看到两次连续比赛的金额和百分比(最多n个匹配,这个例子中只有3场比赛2014-04-02 11:00:00到13:00:00) li>
  • Amount and percentage of times a single match is seen.
  • Amount and percentage of times two consecutive matches are seen (up to n max matches which is just 3 matches 2014-04-02 11:00:00 through 13:00:00 in this example).

我猜这是一个功能中使用的字母,但我肯定许多有经验的编码堆栈溢出用于进行这种分析,所以很乐意学习如何处理这个任务。

I'm guessing this would be a dict used within the function but Im sure many of the experienced coders on Stack Overflow are used to conducting this kind of analysis so would love to learn how to approach this task.

提前感谢您的任何帮助。

Thank you in advance for any help with this.

编辑:

我最初没有指定所需的输出,因为我对所有选项都开放,不想阻止任何人提供解决方案。然而,根据MaxU要求的输出,这样的事情将会很好:

I didn't initially specify desired output as I am open to all options and didn't want to deter anyone from providing solutions. However as per request from MaxU for desired output, something like this would be great:

  Matches       Matches_Percent
0 match    3       30
1 match    4       40
2 match    2       20
3 match    1       10
etc


推荐答案

初始设置



Initial setup

rng = pd.DataFrame({'initial_data': ['A', 'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A',]},
                   index = pd.date_range('4/2/2014', periods=14, freq='BH'))



bool 分配给列'test_for_B'



Assign bool to columns 'test_for_B'

rng['test_for_B'] = rng['initial_data'] == 'B'



Tricky bit



测试'B',最后一行不是'B'。这意味着组的开始。然后 cumsum 将组合在一起。

Tricky bit

Test for 'B' and that last row was not 'B'. This signifies the beginning of a group. Then cumsum ties groups together.

contigious_groups = ((rng.initial_data == 'B') & (rng.initial_data != rng.initial_data.shift())).cumsum()

现在我 groupby 我们创建的这个分组, sum bool 在每个组内。这是一个双重,三重等等。

Now I groupby this grouping we created and sum the bools within each group. This gets at whether its a double, triple, etc.

counts = rng.loc[contigious_groups.astype(bool)].groupby(contigious_groups).test_for_B.sum()

然后使用 value_counts 获取每个组类型的频率,并除以 contigious_groups.max(),因为这是多少组的数。

Then use value_counts to get the frequency of each group type and divide by contigious_groups.max() because that's a count of how many groups.

counts.value_counts() / contigious_groups.max()

3.0    0.5
2.0    0.5
Name: test_for_B, dtype: float64

这篇关于基于数据帧列中的匹配数的统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆