将多个过滤器应用于 Pandas DataFrame 或 Series 的有效方法 [英] Efficient way to apply multiple filters to pandas DataFrame or Series

查看:35
本文介绍了将多个过滤器应用于 Pandas DataFrame 或 Series 的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个场景,用户想要将多个过滤器应用于 Pandas DataFrame 或 Series 对象.本质上,我想有效地将​​用户在运行时指定的一堆过滤(比较操作)链接在一起.

过滤器应该是可加的(也就是应用的每个过滤器都应该缩小结果范围).

我目前正在使用 reindex() 但这每次都会创建一个新对象并复制基础数据(如果我正确理解文档).因此,在过滤大型系列或数据帧时,这可能非常低效.

我认为使用 apply()map() 或类似的东西可能会更好.不过我对 Pandas 还是很陌生,所以仍然试图把我的头脑围绕在所有事情上.

TL;DR

我想采用以下形式的字典并将每个操作应用于给定的 Series 对象并返回一个过滤"的 Series 对象.

relops = {'>=': [1], '<=': [1]}

长示例

我将从我目前拥有的示例开始,并仅过滤单个 Series 对象.以下是我目前使用的功能:

 def apply_relops(series, relops):"""传递关系运算符的字典以在给定的系列对象上执行"""对于 op,在 relops.iteritems() 中的 vals:op_func = ops[op]对于 val 中的 val:过滤 = op_func(series, val)系列 = series.reindex(系列[过滤])回归系列

用户提供了一个包含他们想要执行的操作的字典:

<预><代码>>>>df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})>>>打印文件>>>打印文件列 1 列 20 0 101 1 112 2 12>>>从运营商导入 le, ge>>>ops ={'>=': ge, '<=': le}>>>apply_relops(df['col1'], {'>=': [1]})第 1 列1 12 2名称:col1>>>apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})第 1 列1 1名称:col1

同样,我上述方法的问题"在于,我认为中间步骤有很多可能不必要的数据复制.

此外,我想扩展它,以便传入的字典可以包含要操作的列,并根据输入字典过滤整个 DataFrame.但是,我假设适用于该系列的任何内容都可以轻松扩展为 DataFrame.

解决方案

Pandas(和 numpy)允许 布尔索引,效率更高:

在 [11]: df.loc[df['col1'] >= 1, 'col1']出[11]:1 12 2名称:col1在 [12] 中:df[df['col1'] >= 1]出[12]:列 1 列 21 1 112 2 12在 [13]: df[(df['col1'] >= 1) &(df['col1'] <=1 )]出[13]:列 1 列 21 1 11

如果您想为此编写辅助函数,请考虑以下方面:

在 [14]: def b(x, col, op, n):返回操作(x[col],n)在 [15]: def f(x, *b):返回 x[(np.logical_and(*b))]在 [16] 中:b1 = b(df, 'col1', ge, 1)在 [17] 中:b2 = b(df, 'col1', le, 1)在 [18] 中:f(df, b1, b2)出[18]:列 1 列 21 1 11

更新:pandas 0.13 有查询方法 对于这些类型的用例,假设列名是有效的标识符,以下工作(并且对于大型框架可以更有效,因为它使用 numexpr 在幕后):

在 [21]: df.query('col1 <= 1 & 1 <= col1')出[21]:列 1 列 21 1 11

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.

The filters should be additive (aka each one applied should narrow results).

I'm currently using reindex() but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). So, this could be really inefficient when filtering a big Series or DataFrame.

I'm thinking that using apply(), map(), or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything.

TL;DR

I want to take a dictionary of the following form and apply each operation to a given Series object and return a 'filtered' Series object.

relops = {'>=': [1], '<=': [1]}

Long Example

I'll start with an example of what I have currently and just filtering a single Series object. Below is the function I'm currently using:

   def apply_relops(series, relops):
        """
        Pass dictionary of relational operators to perform on given series object
        """
        for op, vals in relops.iteritems():
            op_func = ops[op]
            for val in vals:
                filtered = op_func(series, val)
                series = series.reindex(series[filtered])
        return series

The user provides a dictionary with the operations they want to perform:

>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
   col1  col2
0     0    10
1     1    11
2     2    12

>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1       1
2       2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1       1
Name: col1

Again, the 'problem' with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.

Also, I would like to expand this so that the dictionary passed in can include the columns to operator on and filter an entire DataFrame based on the input dictionary. However, I'm assuming whatever works for the Series can be easily expanded to a DataFrame.

解决方案

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

这篇关于将多个过滤器应用于 Pandas DataFrame 或 Series 的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆