为什么 pandas 逻辑运算符不像应该那样在索引上对齐? [英] Why isn't pandas logical operator aligning on the index like it should?

查看:68
本文介绍了为什么 pandas 逻辑运算符不像应该那样在索引上对齐?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下简单设置:

x = pd.Series([1, 2, 3], index=list('abc'))
y = pd.Series([2, 3, 3], index=list('bca'))

x

a    1
b    2
c    3
dtype: int64

y

b    2
c    3
a    3
dtype: int64

如您所见,索引是相同的,只是顺序不同.

As you can see, the indexes are the same, just in a different order.

现在,考虑使用等号(==)运算符进行简单的逻辑比较:

Now, consider a simple logical comparison using the equality (==) operator:

x == y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

这抛出ValueError,最可能是因为索引不匹配.另一方面,调用等效的eq运算符可以起作用:

This throws a ValueError, most likely because the indexes do not match. On the other hand, calling the equivalent eq operator works:

x.eq(y)

a    False
b     True
c     True
dtype: bool

OTOH,如果首先重新排序y,运算符方法将起作用...

OTOH, the operator method works given y is first reordered...

x == y.reindex_like(x)

a    False
b     True
c     True
dtype: bool

我的理解是,函数和运算符的比较应该做相同的事情,而其他所有条件都相同. eq在做什么,而运算符比较没有?

My understanding was that the function and operator comparison should do the same thing, all other things equal. What is eq doing that the operator comparison doesn't?

推荐答案

查看整个回溯,以查找索引不匹配的系列比较,尤其关注异常消息:

Viewing the whole traceback for a Series comparison with mismatched indexes, particularly focusing on the exception message:

In [1]: import pandas as pd
In [2]: x = pd.Series([1, 2, 3], index=list('abc'))
In [3]: y = pd.Series([2, 3, 3], index=list('bca'))
In [4]: x == y
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-73b2790c1e5e> in <module>()
----> 1 x == y
/usr/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
   1188 
   1189         elif isinstance(other, ABCSeries) and not self._indexed_same(othe
r):
-> 1190             raise ValueError("Can only compare identically-labeled "
   1191                              "Series objects")
   1192 
ValueError: Can only compare identically-labeled Series objects

我们看到这是一个有意实施的决定.另外,这不是Series对象所独有的-DataFrames会引发类似的错误.

we see that this is a deliberate implementation decision. Also, this is not unique to Series objects - DataFrames raise a similar error.

在有关相关行的Git责任中进行挖掘最终会出现一些相关的提交和问题跟踪线程.例如,Series.__eq__曾经完全忽略RHS的索引,而在熊猫的作者韦斯·麦金尼(Wes McKinney)对有关该行为的错误报告发表了评论:

Digging through the Git blame for the relevant lines eventually turns up some relevant commits and issue tracker threads. For example, Series.__eq__ used to completely ignore the RHS's index, and in a comment on a bug report about that behavior, Pandas author Wes McKinney says the following:

这实际上是功能/故意的选择,而不是错误-它是 与#652 相关.在一月份,我将比较方法更改为 进行自动对齐,但发现它导致大量错误/ 用户的损坏,尤其是许多NumPy函数( 定期做类似arr[1:] == arr[:-1]的事情;例如:np.unique) 停止工作.

This is actually a feature / deliberate choice and not a bug-- it's related to #652. Back in January I changed the comparison methods to do auto-alignment, but found that it led to a large amount of bugs / breakage for users and, in particular, many NumPy functions (which regularly do things like arr[1:] == arr[:-1]; example: np.unique) stopped working.

这又回到了Series不太像ndarray的问题 足够,并且可能不应该是ndarray的子​​类.

This gets back to the issue that Series isn't quite ndarray-like enough and should probably not be a subclass of ndarray.

因此,除此以外,我没有给您一个很好的答案; 自动对齐将是理想的选择,但除非我同意,否则我认为我无法做到这一点 使Series不是ndarray的子​​类.我认为这可能是一个很好的选择 想法,但直到0.9或0.10为止(几月后才可能发生) 路).

So, I haven't got a good answer for you except for that; auto-alignment would be ideal but I don't think I can do it unless I make Series not a subclass of ndarray. I think this is probably a good idea but not likely to happen until 0.9 or 0.10 (several months down the road).

然后将更改为当前在熊猫0.19.0中的行为.引用新功能页面:

This was then changed to the current behavior in pandas 0.19.0. Quoting the "what's new" page:

以下系列运算符已更改为使所有运算符 一致,包括DataFrame( GH1134 GH13538 )

Following Series operators have been changed to make all operators consistent, including DataFrame (GH1134, GH4581, GH13538)

  • 当索引不同时,系列比较运算符现在会引发ValueError.
  • 系列逻辑运算符将左右两侧的索引对齐.

这使Series行为与DataFrame的行为相匹配,DataFrame在比较中已经拒绝了不匹配的索引.

This made the Series behavior match that of DataFrame, which already rejected mismatched indices in comparisons.

总而言之,事实证明,使比较运算符自动对齐索引会破坏太多内容,因此这是最佳选择.

In summary, making the comparison operators align indices automatically turned out to break too much stuff, so this was the best alternative.

这篇关于为什么 pandas 逻辑运算符不像应该那样在索引上对齐?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆