pandas 合并101 [英] Pandas Merging 101
问题描述
- 如何执行与熊猫的(
LEFT
|RIGHT
|FULL
)(INNER
|OUTER
)连接? - 如何在合并后为缺失的行添加NaN?
- 合并后如何去除NaN?
- 我可以合并索引吗?
- 与熊猫交叉连接吗?
- 如何合并多个DataFrame?
-
merge
?join
?concat
?update
?谁?什么?为什么?!
...等等.我已经看到这些重复出现的问题,询问有关熊猫合并功能的各个方面.如今,有关合并及其各种用例的大多数信息都分散在数十个措辞不好,无法搜索的帖子中.这里的目的是整理一些关于后代的更重要的观点.
此QnA旨在作为有关常见熊猫习语的一系列有用的用户指南的下一部分(请参阅这篇有关串联的帖子,我将在稍后进行讨论).
请注意,该帖子不是 ,它是文档,因此也请阅读!一些示例是从那里获取的.
这篇文章旨在为读者提供有关SQL风格与大熊猫的合并,使用方法以及何时不使用它的入门知识.
特别是,这是这篇文章的内容:
-
基本知识-联接类型(左,右,外,内)
- 合并不同的列名
- 避免在输出中出现重复的合并键列
- 在不同条件下合并索引
- 有效使用您的命名索引
- 将键合并为一个键的索引,并将另一个键作为列的索引
- Multiway合并列和索引(唯一和非唯一)
-
merge
和join
的显着替代方案
该帖子不会涉及的内容:
- 与性能有关的讨论和时间安排(目前).在适当的地方,最引人注目的是提到更好的替代方案.
- 处理后缀,删除多余的列,重命名输出以及其他特定用例.还有其他(阅读:更好)的文章可以解决这个问题,所以请弄清楚!
注意
除非另有说明,否则大多数示例在演示各种功能时会默认使用INNER JOIN操作.此外,此处的所有DataFrame都可以复制和复制,因此 你可以和他们一起玩.另请参阅此 发布 关于如何从剪贴板读取DataFrame的信息.
最后,所有JOIN操作的视觉表示都是使用Google绘图手工绘制的.来自此处的启发.
足够多的谈话,请教我如何使用merge
!
设置
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
为简单起见,key列具有相同的名称(目前).
内部联接由
表示
注意
这以及即将出现的数字均遵循以下惯例:
- 蓝色表示合并结果中存在的行
- 红色表示从结果中排除(即已删除)的行
- 绿色表示结果中缺失的值被NaN取代
要执行内部联接,请调用 pd.merge
指定左侧的DataFrame,右侧的DataFrame和连接键.
pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
这仅返回来自left
和right
的共享公共密钥的行(在本示例中为"B"和"D").
在较新的熊猫版本(v0.21左右)中, 在哪里 同样,对于除权利外的JOIN, 最后,如果要求您进行合并,而该合并仅保留左侧或右侧的键,而不同时保留两者(IOW,执行 ANTI-JOIN ), 您可以按照类似的方式进行操作-
如果键列的名称不同(例如,
在 将此与之前的命令输出(这是 例如,考虑 如果只需要合并"new_val"(不包含任何其他列),则通常可以在合并之前仅将其子集作为子集: 如果您要进行LEFT OUTER JOIN,则更高性能的解决方案将涉及 如前所述,这类似于但快于
要在一个以上的列上联接,请为 或者,如果名称不同,
除了 本节仅介绍最基本的内容,目的只是为了激发您的胃口.有关更多示例和案例,请参见关于 设置 通常,索引合并看起来像这样:
如果您的索引已命名,则v0.23用户还可以将级别名称指定为
可以(很简单)使用一个索引和另一个列进行合并.例如, 反之亦然( 在这种特殊情况下, 请注意,我需要指定 由于列名相同.如果它们的名称不同,这将不是问题. 如果需要FULL OUTER JOIN(默认),请省略 有关更多信息,请参见
如果键(此处的键可以是列或索引)是唯一的,则可以使用 忽略 在这种情况下,我们可以使用 ... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity. This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later). Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there. This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it. In particular, here's what this post will go through: The basics - types of joins (LEFT, RIGHT, OUTER, INNER) What this post will not go through: Note Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard. Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Setup For the sake of simplicity, the key column has the same name (for now). An INNER JOIN is represented by Note To perform an INNER JOIN, call This returns only rows from In more recent versions of pandas (v0.21 or so), A LEFT OUTER JOIN, or LEFT JOIN is represented by This can be performed by specifying Carefully note the placement of NaNs here. If you specify And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is... ...specify Here, keys from Finally, for the FULL OUTER JOIN, given by specify This uses the keys from both frames, and NaNs are inserted for missing rows in both. The documentation summarises these various merges nicely: If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps. For LEFT-Excluding JOIN, represented as Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from Where, And similarly, for a RIGHT-Excluding JOIN, Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), You can do this in similar fashion—
If the key columns are named differently—for example,
When merging on Contrast this with the output of the command just before (thst is, the output of For example, consider If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging: If you're doing a LEFT OUTER JOIN, a more performant solution would involve As mentioned, this is similar to, but faster than
To join on more than one column, specify a list for Or, in the event the names are different,
Besides This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on Setup Typically, a merge on index would look like this:
If your index is named, then v0.23 users can also specify the level name to
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example, Or vice versa ( In this special case, the index for Note that I needed to specify the Since the column names are the same. This would not be a problem if they were differently named. Omit For more information, see this canonical post on Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames. Here I introduce
If your keys (here, the key could either be a column or an index) are unique, then you can use Omit In this situation, we can use
这篇关于 pandas 合并101的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!merge
现在是一阶函数,因此您可以调用left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
关键列的不同名称
left
具有keyLeft
,而right
具有keyRight
而不是key
),则必须将left_on
和right_on
指定为参数,而不是on
:left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
在输出中避免重复的键列
left
和right
上的keyLeft
和keyRight
上合并时,如果只想在输出中使用keyLeft
或keyRight
中的一个(但不能同时使用),则可以通过设置索引作为初步步骤.left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
的输出)进行对比,您会发现keyLeft
丢失了.您可以根据将哪个帧的索引设置为关键来确定要保留的列.例如,当执行某些OUTER JOIN操作时,这可能很重要. 仅合并一个
中的单个列
DataFrames
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
map
:# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
合并在多列上
on
(或相应的left_on
和right_on
)指定一个列表.left.merge(right, on=['key1', 'key2'] ...)
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
其他有用的
merge*
操作和功能
merge
之外, DataFrame.combine_first
在某些情况下还用于与另一个DataFrame进行更新.pd.merge_ordered
对于有序的JOIN是有用的功能.pd.merge_asof
(已阅读:merge_asOf)对于近似连接很有用.merge
,join
和concat
以及功能说明的链接.
基于索引的* -JOIN(+索引列
merge
s)np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
对索引名称的支持
on
(或根据需要指定left_on
和right_on
).left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
合并一个索引,另一个索引列
left.merge(right, left_on='key1', right_index=True)
right_on=...
和left_index=True
). right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
left
的索引已命名,因此您也可以将索引名与left_on
一起使用,如下所示:left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
DataFrame.join
除了这些,还有另一个简洁的选择.您可以使用DataFrame.join
,该默认默认为在索引上进行联接. DataFrame.join
默认情况下不做左外部联接,因此在这里how='inner'
是必需的.left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
lsuffix
和rsuffix
参数,因为join
否则会出错:left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat
强>
最后,作为基于索引的联接的替代方法,您可以使用pd.concat
:pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
join='inner'
:pd.concat([left, right], axis=1, sort=False)
value value
A -0.602923 NaN
B -0.402655 0.543843
C 0.302329 NaN
D -0.524349 0.013135
E NaN -0.326498
F NaN 1.385076
对唯一键(或索引)进行多路合并
pd.concat
.请注意, pd.concat
在索引上联接DataFrames .# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
join='inner'
进行完全外部联接.请注意,您不能指定LEFT或RIGHT OUTER联接(如果需要这些联接,请使用join
,如下所述). 对具有重复项的键进行多路合并
concat
速度很快,但也有缺点.它无法处理重复项.A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
join
,因为它可以处理非唯一键(请注意,join
在其索引上联接DataFrames;除非另有说明,否则它将在内部调用merge
并执行LEFT OUTER JOIN ).# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
[df.set_index('key') for df in (B, C)], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) join with pandas?merge
? join
? concat
? update
? Who? What? Why?!
merge
and join
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified. Enough Talk, just show me how to use
merge
!np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
This, along with the forthcoming figures all follow this convention:
pd.merge
specifying the left DataFrame, the right DataFrame, and the join key. pd.merge(left, right, on='key')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
left
and right
which share a common key (in this example, "B" and "D). merge
is now a first order function, so you can call DataFrame.merge
. left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
how='left'
. left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.how='right'
:left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
right
are used, and missing data from left
is replaced by NaN.how='outer'
.left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
left
only,(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation. Merging only a single column from one of the
DataFrames
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
map
:# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
on
(or left_on
and right_on
, as appropriate).left.merge(right, on=['key1', 'key2'] ...)
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful
merge*
operations and functions
merge
, DataFrame.update
and DataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.merge
, join
, and concat
as well as the links to the function specs.
Index-based *-JOIN (+ index-column
merge
s)np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Support for index names
on
(or left_on
and right_on
as necessary).left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Merging on index of one, column(s) of another
left.merge(right, left_on='key1', right_index=True)
right_on=...
and left_index=True
). right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
left
is named, so you can also use the index name with left_on
, like this:left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
lsuffix
and rsuffix
arguments since join
would otherwise error out:left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
join='inner'
if you need a FULL OUTER JOIN (the default):pd.concat([left, right], axis=1, sort=False)
value value
A -0.602923 NaN
B -0.402655 0.543843
C 0.302329 NaN
D -0.524349 0.013135
E NaN -0.326498
F NaN 1.385076
pd.concat
by @piRSquared.
Generalizing:
merge
ing multiple DataFramesmerge
calls:df1.merge(df2, ...).merge(df3, ...)
pd.concat
for multi-way joins on unique keys, and DataFrame.join
for multi-way joins on non-unique keys. First, the setup.# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys (or index)
pd.concat
. Note that pd.concat
joins DataFrames on the index.# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
join='inner'
for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join
, described below).Multiway merge on keys with duplicates
concat
is fast, but has its shortcomings. It cannot handle duplicates.A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
join
since it can handle non-unique keys (note that join
joins DataFrames on their index; it calls merge
under the hood and does a LEFT OUTER JOIN unless otherwise specified).# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
[df.set_index('key') for df in (B, C)], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0