用布尔数组遮罩序列 [英] masking a series with a boolean array
问题描述
这给我带来了很多麻烦,并且让numpy数组与pandas系列不兼容感到困惑.例如,当我使用系列创建布尔数组时
This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
三角面具创建了一个布尔熊猫系列.
delta mask creates a boolean pandas series.
但是,如果您这样做
x[deltamask]
y[deltamask]
您发现数组完全忽略了掩码.不会引发任何错误,但是最终会导致两个长度不同的对象.这意味着类似
You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like
x[deltamask]*y[deltamask]
导致错误:
print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
更令人困惑的是,我注意到运算符<被不同地对待.例如
Even more perplexing, I noticed that the operator < is treated differently. For instance
print type(2*x < x*y)
print type(2 < x*y)
分别给你一个pd.series和np.array.
will give you a pd.series and np.array respectively.
还
5 < x - y
产生一个序列,因此该序列似乎优先,而当将序列掩码的布尔元素传递给numpy数组时,它们会提升为整数,并导致切片数组.
results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.
这是什么原因?
推荐答案
花式索引
就目前numpy而言,numpy中的花式索引的工作方式如下:
As numpy currently stands, fancy indexing in numpy works as follows:
-
如果括号之间的内容是
tuple
(是否具有显式括号),则元组的元素是x
的不同维度的索引.例如,在这种情况下,由于x
为一维,因此x[(True, True)]
和x[True, True]
都将升高IndexError: too many indices for array
.但是,在异常发生之前,也会发出警告:VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future
.
If the thing between brackets is a
tuple
(whether with explicit parens or not), the elements of the tuple are indices for different dimensions ofx
. For example, bothx[(True, True)]
andx[True, True]
will raiseIndexError: too many indices for array
in this case becausex
is 1D. However, before the exception happens, a telling warning will be raised too:VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future
.
如果括号之间的内容完全是 ,而不是子类或其他类似数组的类型,并且具有布尔类型,则将其用作掩码.这就是为什么x[deltamask.values]
给出预期结果(因为deltamask
都是False
的空数组.
If the thing between brackets is exactly an ndarray
, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values]
gives the expected result (empty array since deltamask
is all False
.
如果括号之间的东西是任何类似于数组的东西,无论是像Series
还是list
之类的子类,还是其他东西,它都将转换为np.intp
数组(如果可能)并使用作为整数索引.因此,x[deltamask]
产生的内容等同于x[[False] * 7]
或仅仅是x[[0] * 7]
.在这种情况下,为len(deltamask)==7
和x[0]==1
,因此结果为[1, 1, 1, 1, 1, 1, 1]
.
If the thing between brackets is any array-like, whether a subclass like Series
or just a list
, or something else, it is converted to an np.intp
array (if possible) and used as an integer index. So x[deltamask]
yeilds something equivalent to x[[False] * 7]
or just x[[0] * 7]
. In this case, len(deltamask)==7
and x[0]==1
so the result is [1, 1, 1, 1, 1, 1, 1]
.
此行为是违反直觉的,并且它生成的FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
表示正在修复.当我发现有关numpy或对其进行任何更改时,我将更新此答案.
This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
此信息可在塞巴斯蒂安·伯格(Sebastian Berg)对我对Numpy讨论的初始查询的答复中找到
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
关系运算符
现在,让我们解决关于比较如何工作的问题的第二部分.关系运算符(<
,>
,<=
,>=
)通过在要比较的对象之一上调用相应的方法来工作.对于<
,这是__lt__
.但是,Python不仅检查表达式x < y
的x.__lt__(y)
,而是实际上检查要比较的对象的类型.如果y
是实现比较的x
的子类型,则Python宁愿调用y.__gt__(x)
而不管您如何编写原始比较.如果y
是x
的子类,则调用x.__lt__(y)
的唯一方法是y.__gt__(x)
返回NotImplemented
表示该方向不支持比较.
Now let's address the second part of your question about how the comparison works. Relational operators (<
, >
, <=
, >=
) work by calling the corresponding method on one of the objects being compared. For <
this is __lt__
. However, instead of just calling x.__lt__(y)
for the expression x < y
, Python actually checks the types of the objects being compared. If y
is a subtype of x
that implements the comparison, then Python prefers to call y.__gt__(x)
instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y)
will get called if y
is a subclass of x
is if y.__gt__(x)
returns NotImplemented
to indicate that the comparison is not supported in that direction.
当您执行5 < x - y
时,也会发生类似的情况.尽管ndarray
不是int
的子类,但比较int.__lt__(ndarray)
返回NotImplemented
,因此Python实际上最终调用了(x - y).__gt__(5)
,这当然是已定义的并且可以正常工作.
A similar thing happens when you do 5 < x - y
. While ndarray
is not a subclass of int
, the comparison int.__lt__(ndarray)
returns NotImplemented
, so Python actually ends up calling (x - y).__gt__(5)
, which is of course defined and works just fine.
关于这一切的更简洁的解释可以在 Python中找到文档.
A much more succinct explanation of all this can be found in the Python docs.
这篇关于用布尔数组遮罩序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!