关于外部联接的默认/填充值 [英] On the default/fill value for outer joins

查看:69
本文介绍了关于外部联接的默认/填充值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我正在使用的更大/复杂数据框的小/玩具版本:

Below are teeny/toy versions of much larger/complex dataframes I'm working with:

>>> A
  key         u         v         w         x
0   a  0.757954  0.258917  0.404934  0.303313
1   b  0.583382  0.504687       NaN  0.618369
2   c       NaN  0.982785  0.902166       NaN
3   d  0.898838  0.472143       NaN  0.610887
4   e  0.966606  0.865310       NaN  0.548699
5   f       NaN  0.398824  0.668153       NaN

>>> B
  key         y         z
0   a  0.867603       NaN
1   b       NaN  0.191067
2   c  0.238616  0.803179
3   p  0.080446       NaN
4   q  0.932834       NaN
5   r  0.706561  0.814467

(FWIW,在本文结尾,我提供了生成这些数据帧的代码.)

(FWIW, at the end of this post, I provide code to generate these dataframes.)

我想在key 1 上生成这些数据帧的外部联接,以使外部联接引起的新位置的默认值为0.0. IOW,想要的结果看起来像这样

I want to produce an outer join of these dataframes on the key column1, in such a way that the new positions induced by the outer join get default value 0.0. IOW, the desired result looks like this

  key         u         v         w         x         y         z
0   a  0.757954  0.258917  0.404934  0.303313  0.867603       NaN
1   b  0.583382  0.504687       NaN  0.618369       NaN  0.191067
2   c       NaN  0.982785  0.902166       NaN  0.238616  0.803179
3   d  0.898838  0.472143       NaN  0.610887  0.000000  0.000000
4   e  0.966606   0.86531       NaN  0.548699  0.000000  0.000000
5   f       NaN  0.398824  0.668153       NaN  0.000000  0.000000
6   p  0.000000  0.000000  0.000000  0.000000  0.080446       NaN
7   q  0.000000  0.000000  0.000000  0.000000  0.932834       NaN
8   r  0.000000  0.000000  0.000000  0.000000  0.706561  0.814467

(请注意,此期望的输出包含一些NaN,即AB中已经存在的NaN.)

(Note that this desired output contains some NaNs, namely those that were already present in A or B.)

merge方法可以使我顺利完成工作,但填写的默认值为NaN,而不是0.0:

The merge method gets me part-way there, but the filled-in default values are NaN's, not 0.0's:

>>> C = pandas.DataFrame.merge(A, B, how='outer', on='key')
>>> C
  key         u         v         w         x         y         z
0   a  0.757954  0.258917  0.404934  0.303313  0.867603       NaN
1   b  0.583382  0.504687       NaN  0.618369       NaN  0.191067
2   c       NaN  0.982785  0.902166       NaN  0.238616  0.803179
3   d  0.898838  0.472143       NaN  0.610887       NaN       NaN
4   e  0.966606  0.865310       NaN  0.548699       NaN       NaN
5   f       NaN  0.398824  0.668153       NaN       NaN       NaN
6   p       NaN       NaN       NaN       NaN  0.080446       NaN
7   q       NaN       NaN       NaN       NaN  0.932834       NaN
8   r       NaN       NaN       NaN       NaN  0.706561  0.814467

fillna方法无法产生所需的输出,因为它修改了一些应保留不变的位置:

The fillna method fails to produce the desired output, because it modifies some positions that should be left unchanged:

>>> C.fillna(0.0)
  key         u         v         w         x         y         z
0   a  0.757954  0.258917  0.404934  0.303313  0.867603  0.000000
1   b  0.583382  0.504687  0.000000  0.618369  0.000000  0.191067
2   c  0.000000  0.982785  0.902166  0.000000  0.238616  0.803179
3   d  0.898838  0.472143  0.000000  0.610887  0.000000  0.000000
4   e  0.966606  0.865310  0.000000  0.548699  0.000000  0.000000
5   f  0.000000  0.398824  0.668153  0.000000  0.000000  0.000000
6   p  0.000000  0.000000  0.000000  0.000000  0.080446  0.000000
7   q  0.000000  0.000000  0.000000  0.000000  0.932834  0.000000
8   r  0.000000  0.000000  0.000000  0.000000  0.706561  0.814467

如何有效地获得所需的输出? (这里的性能很重要,因为我打算在比此处显示的数据帧大得多的数据帧上执行此操作.)

How can I achieve the desired output efficiently? (Performance matters here, because I intend to perform this operation on much larger dataframes than those shown here.)

FWIW,下面是生成示例数据帧AB的代码.

FWIW, below is the code to generate the example dataframes A and B.

from pandas import DataFrame
from collections import OrderedDict
from random import random, seed

def make_dataframe(rows, colnames):
    return DataFrame(OrderedDict([(n, [row[i] for row in rows])
                                 for i, n in enumerate(colnames)]))

maybe_nan = lambda: float('nan') if random() < 0.4 else random()

seed(0)

A = make_dataframe([['a', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['b', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['c', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['d', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['e', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
                    ['f', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()]],
                   ('key', 'u', 'v', 'w', 'x'))

B = make_dataframe([['a', maybe_nan(), maybe_nan()],
                    ['b', maybe_nan(), maybe_nan()],
                    ['c', maybe_nan(), maybe_nan()],
                    ['p', maybe_nan(), maybe_nan()],
                    ['q', maybe_nan(), maybe_nan()],
                    ['r', maybe_nan(), maybe_nan()]],
                   ('key', 'y', 'z'))


1 对于多键外连接的情况,请参见


1For for case of multi-key outer joins, see here.

推荐答案

您可以在merge之后填充零:

res = pd.merge(A, B, how="outer")
res.loc[~res.key.isin(A.key), A.columns] = 0

编辑

跳过key列:

res.loc[~res.key.isin(A.key), A.columns.drop("key")] = 0

这篇关于关于外部联接的默认/填充值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆