用布尔数组遮罩序列 [英] masking a series with a boolean array

查看:106
本文介绍了用布尔数组遮罩序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这给我带来了很多麻烦,并且让numpy数组与pandas系列不兼容感到困惑.例如,当我使用系列创建布尔数组时

This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance

x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask =  x- y > delta

三角面具创建了一个布尔熊猫系列.

delta mask creates a boolean pandas series.

但是,如果您这样做

x[deltamask]
y[deltamask]

您发现数组完全忽略了掩码.不会引发任何错误,但是最终会导致两个长度不同的对象.这意味着类似

You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like

x[deltamask]*y[deltamask]

导致错误:

print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]),  len(y[deltamask])

更令人困惑的是,我注意到运算符<被不同地对待.例如

Even more perplexing, I noticed that the operator < is treated differently. For instance

print type(2*x < x*y)
print type(2 <  x*y) 

分别给你一个pd.series和np.array.

will give you a pd.series and np.array respectively.

5 < x - y

产生一个序列,因此该序列似乎优先,而当将序列掩码的布尔元素传递给numpy数组时,它们会提升为整数,并导致切片数组.

results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.

这是什么原因?

推荐答案

花式索引

就目前numpy而言,numpy中的花式索引的工作方式如下:

As numpy currently stands, fancy indexing in numpy works as follows:

  1. 如果括号之间的内容是tuple(是否具有显式括号),则元组的元素是x的不同维度的索引.例如,在这种情况下,由于x为一维,因此x[(True, True)]x[True, True]都将升高IndexError: too many indices for array.但是,在异常发生之前,也会发出警告:VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.

  1. If the thing between brackets is a tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x. For example, both x[(True, True)] and x[True, True] will raise IndexError: too many indices for array in this case because x is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.

如果括号之间的内容完全是 ,而不是子类或其他类似数组的类型,并且具有布尔类型,则将其用作掩码.这就是为什么x[deltamask.values]给出预期结果(因为deltamask都是False的空数组.

If the thing between brackets is exactly an ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values] gives the expected result (empty array since deltamask is all False.

如果括号之间的东西是任何类似于数组的东西,无论是像Series还是list之类的子类,还是其他东西,它都将转换为np.intp数组(如果可能)并使用作为整数索引.因此,x[deltamask]产生的内容等同于x[[False] * 7]或仅仅是x[[0] * 7].在这种情况下,为len(deltamask)==7x[0]==1,因此结果为[1, 1, 1, 1, 1, 1, 1].

If the thing between brackets is any array-like, whether a subclass like Series or just a list, or something else, it is converted to an np.intp array (if possible) and used as an integer index. So x[deltamask] yeilds something equivalent to x[[False] * 7] or just x[[0] * 7]. In this case, len(deltamask)==7 and x[0]==1 so the result is [1, 1, 1, 1, 1, 1, 1].

此行为是违反直觉的,并且它生成的FutureWarning: in the future, boolean array-likes will be handled as a boolean array index表示正在修复.当我发现有关numpy或对其进行任何更改时,我将更新此答案.

This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.

此信息可在塞巴斯蒂安·伯格(Sebastian Berg)对我对Numpy讨论的初始查询的答复中找到

This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.

关系运算符

现在,让我们解决关于比较如何工作的问题的第二部分.关系运算符(<><=>=)通过在要比较的对象之一上调用相应的方法来工作.对于<,这是__lt__.但是,Python不仅检查表达式x < yx.__lt__(y),而是实际上检查要比较的对象的类型.如果y是实现比较的x的子类型,则Python宁愿调用y.__gt__(x)而不管您如何编写原始比较.如果yx的子类,则调用x.__lt__(y)的唯一方法是y.__gt__(x)返回NotImplemented表示该方向不支持比较.

Now let's address the second part of your question about how the comparison works. Relational operators (<, >, <=, >=) work by calling the corresponding method on one of the objects being compared. For < this is __lt__. However, instead of just calling x.__lt__(y) for the expression x < y, Python actually checks the types of the objects being compared. If y is a subtype of x that implements the comparison, then Python prefers to call y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y) will get called if y is a subclass of x is if y.__gt__(x) returns NotImplemented to indicate that the comparison is not supported in that direction.

当您执行5 < x - y时,也会发生类似的情况.尽管ndarray不是int的子类,但比较int.__lt__(ndarray)返回NotImplemented,因此Python实际上最终调用了(x - y).__gt__(5),这当然是已定义的并且可以正常工作.

A similar thing happens when you do 5 < x - y. While ndarray is not a subclass of int, the comparison int.__lt__(ndarray) returns NotImplemented, so Python actually ends up calling (x - y).__gt__(5), which is of course defined and works just fine.

关于这一切的更简洁的解释可以在 Python中找到文档.

A much more succinct explanation of all this can be found in the Python docs.

这篇关于用布尔数组遮罩序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆