大 pandas 数据帧的条件过滤 [英] Conditional filtering of pandas data frame

查看:64
本文介绍了大 pandas 数据帧的条件过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有关足球成绩的大熊猫数据框.数据框的每一行代表一场足球比赛.每次比赛的信息是:

I have a pandas data frame about football results. Each row of the dataframe represents a football match. The information of each match are:

Day | WinningTeamID | LosingTeamID | WinningPoints | LosingPoints | WinningFouls | ... | 
1          13             1              45                5               3  
1          12             4              21                12              4              

也就是说,信息是根据游戏结果划分的:赢还是输. 我想检索特定团队的每场比赛的数据(例如12).

That is, the information are divided based on the game result: winning or losing. I would like to retrieve the data of each game for a specific team (e.g. 12).

Day | Points | Fouls | ... | 
1       21       4     ...
2       32       6     ...

最简单的方法是扫描整个数据框,检查特定的teamID是否在 WinningID LosingID 上,然后在此基础上检索"损失列"或"胜利列". 切片熊猫数据框是否有更优雅"的方式? 这只会给我提供第12队参与比赛的子集.

The simplest way is to scan the whole dataframe, check if a specific teamID is on WinningID or LosingID and then, based on that, retrieve the "Losing-columns" or the "Winning-columns". Is there a more "elegant" way of slicing the pandas dataframe? This will simply give me the subset of matches where the team 12 is involved.

df[df[WinningTeamID == 12] | [LosingTeamID == 12]]

如何过滤这些数据并创建所需的数据框?

How can I filter those data and create the desired dataframe?

推荐答案

假设我们可以选择数据格式.什么是理想的?因为我们 要收集每个TeamID的统计信息,理想情况下,我们将有一列TeamID 并为每个统计信息(包括结果)单独分配一列.

Suppose we could choose the format of the data. What would be ideal? Since we want to collect stats per TeamID, ideally we would have a column of TeamIDs and separate columns for each stat including the outcome.

所以数据看起来像这样:

So the data would look like this:

| Day | Outcome | TeamID | Points | Fouls |
|   1 | Winning |     13 |     45 |     3 |
|   1 | Losing  |      1 |      5 |   NaN |
|   1 | Winning |     12 |     21 |     4 |
|   1 | Losing  |      4 |     12 |   NaN |

这是我们如何将给定数据处理为所需形式的方法:

Here is how we can manipulate the given data into the desired form:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Day': [1, 1], 'LosingPoints': [5, 12], 'LosingTeamID': [1, 4], 'WinningFouls': [3, 4], 'WinningPoints': [45, 21], 'WinningTeamID': [13, 12]})
df = df.set_index(['Day'])
columns = df.columns.to_series().str.extract(r'^(Losing|Winning)?(.*)', expand=True)
columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], 
                                    names=['Outcome', None])
df.columns = columns
df = df.stack(level='Outcome').reset_index()
print(df)

收益

   Day  Outcome  Fouls  Points  TeamID
0    1   Losing    NaN       5       1
1    1  Winning    3.0      45      13
2    1   Losing    NaN      12       4
3    1  Winning    4.0      21      12

现在我们可以使用

print(df.loc[df['TeamID']==12])
#    Day  Outcome  Fouls  Points  TeamID
# 3    1  Winning    4.0      21      12


df = df.set_index(['Day']) 移动Day列进入索引.


df = df.set_index(['Day']) moves the Day column into the index.

Day放置在索引中的目的是保护"它免受操纵 (主要是stack调用),仅用于标记为LosingWinning的列.如果还有其他列,例如Location或 与Day一样与LosingWinning不相关的Officials,则 您也想将它们包括在set_index调用中:例如df = df.set_index(['Day', 'Location', 'Officials']).

The purpose of placing Day in the index is to "protect" it from manipulations (primarily the stack call) that are intended only for columns labeled Losing or Winning. If you had other columns, such as Location or Officials which, like Day, do not pertain to Losing or Winning, then you'd want to include them in the set_index call too: e.g. df = df.set_index(['Day', 'Location', 'Officials']).

尝试从上面的代码中注释掉df = df.set_index(['Day']).然后逐行浏览代码. 特别是,比较 df.stack(level='Outcome') 看起来像有和没有set_index调用:

Try commenting out df = df.set_index(['Day']) from the code above. Then step through the code line-by-line. In particular, compare what df.stack(level='Outcome') looks like with and without the set_index call:

使用df = df.set_index(['Day']):

In [26]: df.stack(level='Outcome')
Out[26]: 
             Fouls  Points  TeamID
Day Outcome                       
1   Losing     NaN       5       1
    Winning    3.0      45      13
    Losing     NaN      12       4
    Winning    4.0      21      12

没有df = df.set_index(['Day']):

In [29]: df.stack(level='Outcome')
Out[29]: 
           Day  Fouls  Points  TeamID
  Outcome                            
0 NaN      1.0    3.0      45      13
  Losing   NaN    NaN       5       1
  Winning  1.0    3.0      45      13
1 NaN      1.0    4.0      21      12
  Losing   NaN    NaN      12       4
  Winning  1.0    4.0      21      12

在没有set_index调用的情况下,您最终得到不需要的行,即Outcome等于NaN的行.

Without the set_index call you end up with rows that you do not want -- the rows where Outcome equals NaN.

columns = df.columns.to_series().str.extract(r'^(Losing|Winning)?(.*)', expand=True)
columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], 
                                    names=['Outcome', None])

是创建一个多级列索引(称为a MultiIndex ) 适当地标记列LosingWinning. 请注意,通过分离标签的LosingWinning部分, 标签的其余部分将重复.

is to create a multi-level column index (called a MultiIndex) which labels columns Losing or Winning as appropriate. Notice that by separating out the Losing or Winning parts of the labels, the remaining parts of the labels become duplicated.

我们最终得到一个DataFrame,df,其中有两列标记为"Points". 这样一来,Pandas就可以将这些列标识为相似的列.

We end up with a DataFrame, df, with two columns labeled "Points" for example. This allows Pandas to identify these columns as somehow similar.

最大的收获-我们遇到了设置MultiIndex的麻烦,原因是可以通过调用

The big gain -- the reason why we went through the trouble of setting up the MultiIndex is so that these "similar" columns can be "unified" by calling df.stack:

In [47]: df
Out[47]: 
Outcome Losing        Winning              
        Points TeamID   Fouls Points TeamID
Day                                        
1            5      1       3     45     13
1           12      4       4     21     12

In [48]: df.stack(level="Outcome")
Out[48]: 
             Fouls  Points  TeamID
Day Outcome                       
1   Losing     NaN       5       1
    Winning    3.0      45      13
    Losing     NaN      12       4
    Winning    4.0      21      12


stackunstackset_indexreset_index是DataFrame重塑的4种基本操作.


stack, unstack, set_index and reset_index are the 4 fundamental DataFrame reshaping operations.

  • df.stack 移动了一个列索引到行索引中的一个或多个级别.
  • df.unstack 移动了一个行索引到列索引中的一个或多个级别.
  • df.set_index 移动列值插入行索引
  • df.reset_index 移动了一个行索引到一个值列中的一个或多个级别

这4种方法一起使您可以将DataFrame中的数据移动到任何位置 想要-在列,行索引或列索引中.

Together, these 4 methods allow you to move data in your DataFrame anywhere you want -- in the columns, the row index or the column index.

上面的代码是如何使用这些工具的示例(很好,四个中的三个) 来重塑数据到所需的形式.

The above code is an example of how to use these tools (well, three of the four) to reshape data into a desired form.

这篇关于大 pandas 数据帧的条件过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆