pandas -胜率计算;按两列分组并计数 [英] Pandas - win ratio calculation; group by two columns and count

查看:126
本文介绍了 pandas -胜率计算;按两列分组并计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下列的数据框:

| winner |  loser  | tournament |
+--------+---------+------------+
| John   | Steve   |      A     |
+--------+---------+------------+
| Steve  | John    |      B     |
+--------+---------+------------+
| John   | Michael |      A     |
+--------+---------+------------+
| Steve  | John    |      A     |
+--------+---------+------------+

我想要做的是针对给定的锦标赛类型计算获胜者和输者的历史获胜百分比,并将其放在自己的栏中.


下面是填写上表的示例.游戏将被称为(获胜者,失败者,类型).

我也添加了中间计算列,以使其更清晰.


1)对于第一场比赛(约翰,史蒂夫,A).以前没有A型游戏.因此我们填充0.

2)第二局(Steve,John,B).以前没有类型B的游戏.因此我们用0填充.

3)第三局(John,Michael,A).以前有Type A游戏,所以我们可以获得信息.首先,约翰是赢家.他在表的第一行赢得了A类的一场比赛.因此,我们将获胜者获胜=1.约翰之前没有输过A类游戏,因此我们将获胜者损失=0.迈克尔没有任何比赛历史,因此我们填充了失败者获胜= 0和失败者损失= 0. >

4)对于第四局,(史蒂夫,约翰,A).我们看到史蒂夫以前从未赢得过任何A类游戏,因此我们将获胜者获胜次数=0.他输了1次A类游戏(第一行).因此,我们将获胜者损失=1.约翰赢得了2场A类游戏,因此失败者获胜=2.他输了

    +--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| winner |  loser  | tournament | winner wins | loser wins | winner losses | loser losses | winner win % | loser win % |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  John  |  Steve  |      A     |      0      |      0     |       0       |       0      | 0/(0+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  Steve |   John  |      B     |      0      |      0     |       0       |       0      | 0/(0+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  John  | Michael |      A     |      1      |      0     |       0       |       0      | 1/(1+0)      | 0/(0+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
|  Steve |   John  |      A     |      0      |      2     |       1       |       0      | 0/(0+1)      | 2/(2+0)     |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------

解决方案

这应该产生预期的结果:

def win_los_percent(sdf):
    sdf['winner wins'] = sdf.groupby('winner').cumcount()
    sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
    sdf['loser losses'] = sdf.groupby('loser').cumcount()
    sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
    sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
    sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
    return sdf

ddf = df.groupby('tournament').apply(win_los_percent)

使用提供的数据,ddf为:

  winner    loser tournament  winner wins  winner losses  loser losses  loser wins  winner win %  loser win %
0   John    Steve          A            0              0             0           0           NaN          NaN
1  Steve     John          B            0              0             0           0           NaN          NaN
2   John  Michael          A            1              0             0           0           1.0          NaN
3  Steve     John          A            0              1             0           2           0.0          1.0

熊猫groupby 是用于对同一比赛的数据进行分组,并将子数据帧传递给win_los_percent函数.返回此函数的返回值以构建最终数据帧.

对于每个子集,该函数计算几列:

  • sdf['winner wins']sdf['loser losses']是通过使用 解决方案

This shoud produce the expected result:

def win_los_percent(sdf):
    sdf['winner wins'] = sdf.groupby('winner').cumcount()
    sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
    sdf['loser losses'] = sdf.groupby('loser').cumcount()
    sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
    sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
    sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
    return sdf

ddf = df.groupby('tournament').apply(win_los_percent)

Using the data provided, ddf is:

  winner    loser tournament  winner wins  winner losses  loser losses  loser wins  winner win %  loser win %
0   John    Steve          A            0              0             0           0           NaN          NaN
1  Steve     John          B            0              0             0           0           NaN          NaN
2   John  Michael          A            1              0             0           0           1.0          NaN
3  Steve     John          A            0              1             0           2           0.0          1.0

pandas groupby is used to group the data of the same tournament, and pass the subdataframe to the win_los_percent function. The return value of this function is returned to build the final dataframe.

For each subset, the function calculates the several columns:

  • sdf['winner wins'] and sdf['loser losses'] are obtained by using cumcount: for each row, this method counts the previous occurrences of the value (the player name) in the grouping column.
  • obtaining sdf['winner losses'] and sdf['loser wins'] is a bit more elaborated, since we need to count the previous occurrence of a value (player name) in another column.
    The list comprehension iterates over the dataframe index to select the previous rows and check if the player name in the column 'winner' is equal to the playeer name at current row in column loser or viceversa. sum() allows to count the True values: True is casted to 1, False to 0, the sum gives the wanted result: how many times the player name is present in previous rows.
  • Percentage columns are obtained with vectorization. Where the result is NaN is because there is a division by 0.

这篇关于 pandas -胜率计算;按两列分组并计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆