关于外部联接的默认/填充值 [英] On the default/fill value for outer joins
问题描述
以下是我正在使用的更大/复杂数据框的小/玩具版本:
Below are teeny/toy versions of much larger/complex dataframes I'm working with:
>>> A
key u v w x
0 a 0.757954 0.258917 0.404934 0.303313
1 b 0.583382 0.504687 NaN 0.618369
2 c NaN 0.982785 0.902166 NaN
3 d 0.898838 0.472143 NaN 0.610887
4 e 0.966606 0.865310 NaN 0.548699
5 f NaN 0.398824 0.668153 NaN
>>> B
key y z
0 a 0.867603 NaN
1 b NaN 0.191067
2 c 0.238616 0.803179
3 p 0.080446 NaN
4 q 0.932834 NaN
5 r 0.706561 0.814467
(FWIW,在本文结尾,我提供了生成这些数据帧的代码.)
(FWIW, at the end of this post, I provide code to generate these dataframes.)
我想在key
列 1 上生成这些数据帧的外部联接,以使外部联接引起的新位置的默认值为0.0. IOW,想要的结果看起来像这样
I want to produce an outer join of these dataframes on the key
column1, in such a way that the new positions induced by the outer join get default value 0.0. IOW, the desired result looks like this
key u v w x y z
0 a 0.757954 0.258917 0.404934 0.303313 0.867603 NaN
1 b 0.583382 0.504687 NaN 0.618369 NaN 0.191067
2 c NaN 0.982785 0.902166 NaN 0.238616 0.803179
3 d 0.898838 0.472143 NaN 0.610887 0.000000 0.000000
4 e 0.966606 0.86531 NaN 0.548699 0.000000 0.000000
5 f NaN 0.398824 0.668153 NaN 0.000000 0.000000
6 p 0.000000 0.000000 0.000000 0.000000 0.080446 NaN
7 q 0.000000 0.000000 0.000000 0.000000 0.932834 NaN
8 r 0.000000 0.000000 0.000000 0.000000 0.706561 0.814467
(请注意,此期望的输出包含一些NaN,即A
或B
中已经存在的NaN.)
(Note that this desired output contains some NaNs, namely those that were already present in A
or B
.)
merge
方法可以使我顺利完成工作,但填写的默认值为NaN,而不是0.0:
The merge
method gets me part-way there, but the filled-in default values are NaN's, not 0.0's:
>>> C = pandas.DataFrame.merge(A, B, how='outer', on='key')
>>> C
key u v w x y z
0 a 0.757954 0.258917 0.404934 0.303313 0.867603 NaN
1 b 0.583382 0.504687 NaN 0.618369 NaN 0.191067
2 c NaN 0.982785 0.902166 NaN 0.238616 0.803179
3 d 0.898838 0.472143 NaN 0.610887 NaN NaN
4 e 0.966606 0.865310 NaN 0.548699 NaN NaN
5 f NaN 0.398824 0.668153 NaN NaN NaN
6 p NaN NaN NaN NaN 0.080446 NaN
7 q NaN NaN NaN NaN 0.932834 NaN
8 r NaN NaN NaN NaN 0.706561 0.814467
fillna
方法无法产生所需的输出,因为它修改了一些应保留不变的位置:
The fillna
method fails to produce the desired output, because it modifies some positions that should be left unchanged:
>>> C.fillna(0.0)
key u v w x y z
0 a 0.757954 0.258917 0.404934 0.303313 0.867603 0.000000
1 b 0.583382 0.504687 0.000000 0.618369 0.000000 0.191067
2 c 0.000000 0.982785 0.902166 0.000000 0.238616 0.803179
3 d 0.898838 0.472143 0.000000 0.610887 0.000000 0.000000
4 e 0.966606 0.865310 0.000000 0.548699 0.000000 0.000000
5 f 0.000000 0.398824 0.668153 0.000000 0.000000 0.000000
6 p 0.000000 0.000000 0.000000 0.000000 0.080446 0.000000
7 q 0.000000 0.000000 0.000000 0.000000 0.932834 0.000000
8 r 0.000000 0.000000 0.000000 0.000000 0.706561 0.814467
如何有效地获得所需的输出? (这里的性能很重要,因为我打算在比此处显示的数据帧大得多的数据帧上执行此操作.)
How can I achieve the desired output efficiently? (Performance matters here, because I intend to perform this operation on much larger dataframes than those shown here.)
FWIW,下面是生成示例数据帧A
和B
的代码.
FWIW, below is the code to generate the example dataframes A
and B
.
from pandas import DataFrame
from collections import OrderedDict
from random import random, seed
def make_dataframe(rows, colnames):
return DataFrame(OrderedDict([(n, [row[i] for row in rows])
for i, n in enumerate(colnames)]))
maybe_nan = lambda: float('nan') if random() < 0.4 else random()
seed(0)
A = make_dataframe([['a', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['b', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['c', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['d', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['e', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()],
['f', maybe_nan(), maybe_nan(), maybe_nan(), maybe_nan()]],
('key', 'u', 'v', 'w', 'x'))
B = make_dataframe([['a', maybe_nan(), maybe_nan()],
['b', maybe_nan(), maybe_nan()],
['c', maybe_nan(), maybe_nan()],
['p', maybe_nan(), maybe_nan()],
['q', maybe_nan(), maybe_nan()],
['r', maybe_nan(), maybe_nan()]],
('key', 'y', 'z'))
1For for case of multi-key outer joins, see here.
推荐答案
您可以在merge
之后填充零:
res = pd.merge(A, B, how="outer")
res.loc[~res.key.isin(A.key), A.columns] = 0
编辑
跳过key
列:
res.loc[~res.key.isin(A.key), A.columns.drop("key")] = 0
这篇关于关于外部联接的默认/填充值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!