根据条件从 pandas 数据框(或numpy ndarray?)中选择 [英] Selecting from pandas dataframe (or numpy ndarray?) by criterion

查看:77
本文介绍了根据条件从 pandas 数据框(或numpy ndarray?)中选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现自己对这种模式进行了很多编码 :

I find myself coding this sort of pattern a lot:

tmp = <some operation>
result = tmp[<boolean expression>]
del tmp

...其中,<boolean expression>应理解为布尔表达式 involving tmp. (暂时,tmp始终是一个熊猫数据框,但我想如果我使用numpy ndarrays,也会显示相同的模式-不确定.)

...where <boolean expression> is to be understood as a boolean expression involving tmp. (For the time being, tmp is always a pandas dataframe, but I suppose that the same pattern would show up if I were working with numpy ndarrays--not sure.)

例如:

tmp = df.xs('A')['II'] - df.xs('B')['II']
result = tmp[tmp < 0]
del tmp

从最后的del tmp可以猜到,创建tmp only 的原因是,这样我就可以在应用于该索引的表达式中使用一个布尔表达式,将其包含在内它.

As one can guess from the del tmp at the end, the only reason for creating tmp at all is so that I can use a boolean expression involving it inside an indexing expression applied to it.

我很想消除对这种(否则无用的)中间体的需要,但是我不知道有什么有效的 1 方式可以做到这一点. (请纠正我,如果我错了!)

I would love to eliminate the need for this (otherwise useless) intermediate, but I don't know of any efficient1 way to do this. (Please, correct me if I'm wrong!)

第二好的,我想将此模式推到一些辅助函数中.问题是找到一种将<boolean expression>传递给它的不错的方法.我只能想到in亵的人.例如:

As second best, I'd like to push off this pattern to some helper function. The problem is finding a decent way to pass the <boolean expression> to it. I can only think of indecent ones. E.g.:

def filterobj(obj, criterion):
    return obj[eval(criterion % 'obj')]

这实际上有效 2 :

filterobj(df.xs('A')['II'] - df.xs('B')['II'], '%s < 0')

# Int
# 0     -1.650107
# 2     -0.718555
# 3     -1.725498
# 4     -0.306617
# Name: II

...但是使用eval总是让我感觉到所有yukky'n'东西...请让我知道是否还有其他方法.

...but using eval always leaves me feeling all yukky 'n' stuff... Please let me know if there's some other way.

1 例如,我想到的涉及内置filter的任何方法都可能是无效的,因为它会通过在Python中"迭代来应用标准(一些lambda函数),在熊猫(或numpy)对象上...

1E.g., any approach I can think of involving the filter built-in is probably ineffiencient, since it would apply the criterion (some lambda function) by iterating, "in Python", over the panda (or numpy) object...

2 上面最后一个表达式中使用的df定义如下:

2The definition of df used in the last expression above would be something like this:

import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = range(5)
ix = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
                               names=('Alpha', 'Int'))
c = ('I', 'II', 'III')
df = pd.DataFrame(np.random.randn(len(idx), len(c)), index=ix, columns=c)

推荐答案

由于Python的工作方式,我认为这会很困难.我只能想到一些骇客,它们只会使您成为其中的一部分.像

Because of the way Python works, I think this one's going to be tough. I can only think of hacks which only get you part of the way there. Something like

def filterobj(obj, fn):
    return obj[fn(obj)]

filterobj(df.xs('A')['II'] - df.xs('B')['II'], lambda x: x < 0)

应该工作,除非我错过了一些东西.这种方式使用lambda是延迟评估的常用技巧之一.

should work, unless I've missed something. Using lambdas this way is one of the usual tricks for delaying evaluation.

大声思考:一个人可以制作一个this对象,该对象不会被评估,只是作为表达式而存在,就像

Thinking out loud: one could make a this object which isn't evaluated but just sticks around as an expression, something like

>>> this
this
>>> this < 3
this < 3
>>> df[this < 3]
Traceback (most recent call last):
  File "<ipython-input-34-d5f1e0baecf9>", line 1, in <module>
    df[this < 3]
[...]
KeyError: u'no item named this < 3'

,然后将this特殊处理成大熊猫,或者仍然具有类似的功能

and then either special-case the treatment of this into pandas or still have a function like

def filterobj(obj, criterion):
    return obj[eval(str(criterion.subs({"this": "obj"})))]

(如果有足够的工作,我们可能会丢失eval,这仅仅是概念上的证明),然后类似

(with enough work we could lose the eval, this is simply proof of concept) after which something like

>>> tmp = df["I"] + df["II"]
>>> tmp[tmp < 0]
Alpha  Int
A      4     -0.464487
B      3     -1.352535
       4     -1.678836
Dtype: float64
>>> filterobj(df["I"] + df["II"], this < 0)
Alpha  Int
A      4     -0.464487
B      3     -1.352535
       4     -1.678836
Dtype: float64

会工作.我不确定这其中的任何一个值得头痛,但是,Python根本不是非常有利于这种样式.

would work. I'm not sure any of this is worth the headache, though, Python simply isn't very conducive to this style.

这篇关于根据条件从 pandas 数据框(或numpy ndarray?)中选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆