pandas -胜率计算;按两列分组并计数 [英] Pandas - win ratio calculation; group by two columns and count
问题描述
我有一个包含以下列的数据框:
| winner | loser | tournament |
+--------+---------+------------+
| John | Steve | A |
+--------+---------+------------+
| Steve | John | B |
+--------+---------+------------+
| John | Michael | A |
+--------+---------+------------+
| Steve | John | A |
+--------+---------+------------+
我想要做的是针对给定的锦标赛类型计算获胜者和输者的历史获胜百分比,并将其放在自己的栏中.
下面是填写上表的示例.游戏将被称为(获胜者,失败者,类型).
我也添加了中间计算列,以使其更清晰.
1)对于第一场比赛(约翰,史蒂夫,A).以前没有A型游戏.因此我们填充0.
2)第二局(Steve,John,B).以前没有类型B的游戏.因此我们用0填充.
3)第三局(John,Michael,A).以前有Type A游戏,所以我们可以获得信息.首先,约翰是赢家.他在表的第一行赢得了A类的一场比赛.因此,我们将获胜者获胜=1.约翰之前没有输过A类游戏,因此我们将获胜者损失=0.迈克尔没有任何比赛历史,因此我们填充了失败者获胜= 0和失败者损失= 0. >
4)对于第四局,(史蒂夫,约翰,A).我们看到史蒂夫以前从未赢得过任何A类游戏,因此我们将获胜者获胜次数=0.他输了1次A类游戏(第一行).因此,我们将获胜者损失=1.约翰赢得了2场A类游戏,因此失败者获胜=2.他输了
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| winner | loser | tournament | winner wins | loser wins | winner losses | loser losses | winner win % | loser win % |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| John | Steve | A | 0 | 0 | 0 | 0 | 0/(0+0) | 0/(0+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| Steve | John | B | 0 | 0 | 0 | 0 | 0/(0+0) | 0/(0+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| John | Michael | A | 1 | 0 | 0 | 0 | 1/(1+0) | 0/(0+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------+
| Steve | John | A | 0 | 2 | 1 | 0 | 0/(0+1) | 2/(2+0) |
+--------+---------+------------+-------------+------------+---------------+--------------+--------------+-------------
这应该产生预期的结果:
def win_los_percent(sdf):
sdf['winner wins'] = sdf.groupby('winner').cumcount()
sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
sdf['loser losses'] = sdf.groupby('loser').cumcount()
sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
return sdf
ddf = df.groupby('tournament').apply(win_los_percent)
使用提供的数据,ddf
为:
winner loser tournament winner wins winner losses loser losses loser wins winner win % loser win %
0 John Steve A 0 0 0 0 NaN NaN
1 Steve John B 0 0 0 0 NaN NaN
2 John Michael A 1 0 0 0 1.0 NaN
3 Steve John A 0 1 0 2 0.0 1.0
熊猫groupby 是用于对同一比赛的数据进行分组,并将子数据帧传递给win_los_percent
函数.返回此函数的返回值以构建最终数据帧.
对于每个子集,该函数计算几列:
-
sdf['winner wins']
和sdf['loser losses']
是通过使用 解决方案
This shoud produce the expected result:
def win_los_percent(sdf):
sdf['winner wins'] = sdf.groupby('winner').cumcount()
sdf['winner losses'] = [(sdf.loc[0:i, 'loser'] == sdf.loc[i, 'winner']).sum() for i in sdf.index]
sdf['loser losses'] = sdf.groupby('loser').cumcount()
sdf['loser wins'] = [(sdf.loc[0:i, 'winner'] == sdf.loc[i, 'loser']).sum() for i in sdf.index]
sdf['winner win %'] = sdf['winner wins'] / (sdf['winner wins'] + sdf['winner losses'])
sdf['loser win %'] = sdf['loser wins'] / (sdf['loser wins'] + sdf['loser losses'])
return sdf
ddf = df.groupby('tournament').apply(win_los_percent)
Using the data provided, ddf
is:
winner loser tournament winner wins winner losses loser losses loser wins winner win % loser win %
0 John Steve A 0 0 0 0 NaN NaN
1 Steve John B 0 0 0 0 NaN NaN
2 John Michael A 1 0 0 0 1.0 NaN
3 Steve John A 0 1 0 2 0.0 1.0
pandas groupby is used to group the data of the same tournament, and pass the subdataframe to the win_los_percent
function. The return value of this function is returned to build the final dataframe.
For each subset, the function calculates the several columns:
sdf['winner wins']
andsdf['loser losses']
are obtained by using cumcount: for each row, this method counts the previous occurrences of the value (the player name) in the grouping column.- obtaining
sdf['winner losses']
andsdf['loser wins']
is a bit more elaborated, since we need to count the previous occurrence of a value (player name) in another column.
The list comprehension iterates over the dataframe index to select the previous rows and check if the player name in the column'winner'
is equal to the playeer name at current row in columnloser
or viceversa.sum()
allows to count the True values: True is casted to 1, False to 0, the sum gives the wanted result: how many times the player name is present in previous rows. - Percentage columns are obtained with vectorization. Where the result is
NaN
is because there is a division by 0.
这篇关于 pandas -胜率计算;按两列分组并计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!