为什么 pandas 会在NaN上合并? [英] Why does pandas merge on NaN?
问题描述
我最近问了一个有关熊猫 github问题.阅读完该页面和缺少数据文档之后.
I recently asked a question regarding missing values in pandas here and was directed to a github issue. After reading through that page and the missing data documentation.
我想知道为什么merge
和join
在它们不等于"时将NaN视为匹配项:np.nan != np.nan
I am wondering why merge
and join
treat NaNs as a match when "they don't compare equal": np.nan != np.nan
# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')
col1 col2 col3
0 NaN 1 3
# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))
col2 col3
col1
NaN 1 3.0
match 2 NaN
但是,排除了groupby
中的NaN:
However, NaNs in groupby
are excluded:
df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()
col2
col1
match 2
您当然可以dropna()
或df[df['col1'].notnull()]
,但是我很好奇为什么在某些熊猫操作(例如groupby
)中排除了NaN,而在诸如merge
,join
,update
和map
?
Of course you can dropna()
or df[df['col1'].notnull()]
but I am curious as to why NaNs are excluded in some pandas operations like groupby
and not others like merge
, join
, update
, and map
?
本质上,正如我在上面所问的,为什么merge
和join
在np.nan
上匹配时为什么不相等?
Essentially, as I asked above, why does merge
and join
match on np.nan
when they do not compare equal?
推荐答案
是的,这肯定是一个错误.请参阅 GH22491 ,其中确切记录了您的问题,以及
Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None
. based on the discussions, this does not appear to be intended behaviour.
A quick source dive shows that the issue *might* be inside the _factorize_keys
function in pandas/core/reshape/merge.py
. This function appears to factorise the keys to determine what rows are to be matched with each other.
具体来说,这部分
# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()
if lany or rany:
if lany:
np.putmask(llab, lmask, count)
if rany:
np.putmask(rlab, rmask, count)
count += 1
...似乎是罪魁祸首. NaN键被标识为有效类别(类别值等于count
).
...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count
).
免责声明:我不是熊猫开发者,这只是我的猜测;因此真正的问题可能是其他问题.但是乍看之下,似乎是这样.
Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.
这篇关于为什么 pandas 会在NaN上合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!