pandas isin()返回不同的结果作为eq()-浮动dtype依赖问题 [英] pandas isin() returns different result as eq() - floating dtype dependency issue

查看:76
本文介绍了 pandas isin()返回不同的结果作为eq()-浮动dtype依赖问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pandas' isin 方法似乎具有dtype依赖性(将Python 3.5与pandas 0.19.2一起使用).我只是在相关的主题我们无法解释isin的无效行为.这是示例:

pandas' isin method seems to have a dtype dependency (using Python 3.5 with pandas 0.19.2). I just came across this by accident in a related topic where we couldn't explain a non-working behavior for isin. Here is the example:

df = pd.DataFrame([[1.2, 0.3, 1.5, 1.4, 1.7, 4.2]])
print(df)

    0       1       2       3       4       5
0   1.2     0.3     1.5     1.4     1.7     4.2

print(df.dtypes)
0    float64
1    float64
2    float64
3    float64
4    float64
5    float64
dtype: object

# everything works as expected until here
print(df.isin([1.2, 1.4]))

      0      1      2     3      4      5
0  True  False  False  True  False  False

但是,当dtype强制转换为float32时,isin开始失败:

However, when the dtype is casted to float32, isin starts to fail:

df = df.apply(lambda x: x.astype("float32"))
print(df.dtypes)

0    float32
1    float32
2    float32
3    float32
4    float32
5    float32
dtype: object

print(df.isin([1.2, 1.4]))
       0      1      2      3      4      5
0  False  False  False  False  False  False

这是关于SO的类似帖子.

我了解浮点运算的复杂性.但是,从希望将isin用作col1 == 1 | col1 == 3 | col1 == 5的便利功能(简单地写col1.isin([1, 3, 5]))的用户角度来看,当dtype不同时,可能会导致无法识别的错误,并且不会给出有关dtype偏差的警告.

I understand the floating point complication. However, from a users perspective who wants to employ isin as a convenience function for col1 == 1 | col1 == 3 | col1 == 5 (to simply write col1.isin([1, 3, 5])), it may cause unrecognized errors when dtypes are different and no warning is given about the dtype deviation.

此外,与df.eq相比,isin返回的结果不同:

What's more, isin returns different results in comparison to df.eq:

print(df.isin([1.2]))

       0      1      2      3      4      5
0  False  False  False  False  False  False

print(df.eq(1.2))

      0      1      2      3      4      5
0  True  False  False  False  False  False

这绝对是不受欢迎的行为.正如 JohnE 指出的那样,似乎df.eq使用np.isclose,而df.isin没有.

This is definitely an unwanted behavior. As JohnE pointed out, it seems df.eq uses np.isclose whereas df.isin does not.

推荐答案

也许这将使其更加清晰:

Maybe this will make it more clear:

>>> '%20.18f' % df[0].astype(np.float64)
'1.199999999999999956'

>>> '%20.18f' % df[0].astype(np.float32)
'1.200000047683715820'

通常,您不希望看到小数点后18位,因此大熊猫会合理选择要显示的小数位数-但是差异仍然存在,尽管是看不见的.因此,您需要确保将float64与float64和float32与float32进行比较.那就是我们为自己选择的浮点生命……

Generally you don't want to see 18 decimal places so pandas will make reasonable choices about how many decimals to display -- but the difference is still there, albeit invisibly. So you need to make sure to compare float64 to float64 and float32 to float32. That's the floating point life we have chosen for ourselves...

或者,如果您一次与一个值进行比较,则可以使用np.isclose(在import numpy as np之后)来识别近似相等性:

Alternatively, if you were comparing to the values one at a time you could use np.isclose (after import numpy as np) to identify an approximate equality:

>>> np.isclose( df.astype(np.float64), 1.2 )
array([[ True, False, False, False, False, False]], dtype=bool)

>>> np.isclose( df.astype(np.float32), 1.2 )
array([[ True, False, False, False, False, False]], dtype=bool)

(当然,您不需要astype()只是为了证明您对float32和float64都能得到相同的答案.)

(You don't need the astype(), of course, it's just to prove that you would get the same answer for both float32 and float64.)

我不知道是否有一种方法可以使isin以可比的方式工作,因此您可能必须执行以下操作:

I don't know if there is a way to make isin work in a comparable way so you may have to do something like:

>>> np.isclose( df, 1.2 ) | np.isclose( df, 1.4 )
array([[ True, False, False,  True, False, False]], dtype=bool)

这篇关于 pandas isin()返回不同的结果作为eq()-浮动dtype依赖问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆