为什么 pandas 会在NaN上合并? [英] Why does pandas merge on NaN?

查看:86
本文介绍了为什么 pandas 会在NaN上合并?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近问了一个有关熊猫缺少数据文档之后.

I recently asked a question regarding missing values in pandas here and was directed to a github issue. After reading through that page and the missing data documentation.

我想知道为什么mergejoin在它们不等于"时将NaN视为匹配项:np.nan != np.nan

I am wondering why merge and join treat NaNs as a match when "they don't compare equal": np.nan != np.nan

# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')

    col1    col2    col3
0   NaN      1       3

# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))

      col2  col3
col1        
NaN     1   3.0
match   2   NaN

但是,排除了groupby中的NaN:

However, NaNs in groupby are excluded:

df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()

       col2
col1    
match   2

您当然可以dropna()df[df['col1'].notnull()],但是我很好奇为什么在某些熊猫操作(例如groupby)中排除了NaN,而在诸如mergejoinupdatemap?

Of course you can dropna() or df[df['col1'].notnull()] but I am curious as to why NaNs are excluded in some pandas operations like groupby and not others like merge, join, update, and map?

本质上,正如我在上面所问的,为什么mergejoinnp.nan上匹配时为什么不相等?

Essentially, as I asked above, why does merge and join match on np.nan when they do not compare equal?

推荐答案

是的,这肯定是一个错误.请参阅 GH22491 ,其中确切记录了您的问题,以及

Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None. based on the discussions, this does not appear to be intended behaviour.

快速的消息源显示,问题* 可能 *位于

A quick source dive shows that the issue *might* be inside the _factorize_keys function in pandas/core/reshape/merge.py. This function appears to factorise the keys to determine what rows are to be matched with each other.

具体来说,这部分

# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()

if lany or rany:
    if lany:
        np.putmask(llab, lmask, count)
    if rany:
        np.putmask(rlab, rmask, count)
    count += 1

...似乎是罪魁祸首. NaN键被标识为有效类别(类别值等于count).

...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count).

免责声明:我不是熊猫开发者,这只是我的猜测;因此真正的问题可能是其他问题.但是乍看之下,似乎是这样.

Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.

这篇关于为什么 pandas 会在NaN上合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆