pandas 合并101 [英] Pandas Merging 101

查看：101 发布时间：2019/9/19 15:34:59 python pandas join merge

本文介绍了 pandas 合并101的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何执行与熊猫的(LEFT | RIGHT | FULL)(INNER | OUTER)连接?
如何在合并后为缺失的行添加NaN?
合并后如何去除NaN?
我可以合并索引吗?
与熊猫交叉连接吗?

如何合并多个DataFrame?

merge? join? concat? update?谁?什么?为什么?！

...等等.我已经看到这些重复出现的问题，询问有关熊猫合并功能的各个方面.如今，有关合并及其各种用例的大多数信息都分散在数十个措辞不好，无法搜索的帖子中.这里的目的是整理一些关于后代的更重要的观点.

此QnA旨在作为有关常见熊猫习语的一系列有用的用户指南的下一部分(请参阅这篇有关串联的帖子，我将在稍后进行讨论).

请注意，该帖子不是，它是文档，因此也请阅读！一些示例是从那里获取的.

解决方案

这篇文章旨在为读者提供有关SQL风格与大熊猫的合并，使用方法以及何时不使用它的入门知识.

特别是，这是这篇文章的内容:

基本知识-联接类型(左，右，外，内)
- 合并不同的列名
- 避免在输出中出现重复的合并键列
在不同条件下合并索引
- 有效使用您的命名索引
- 将键合并为一个键的索引，并将另一个键作为列的索引
Multiway合并列和索引(唯一和非唯一)
merge和join

该帖子不会涉及的内容:

与性能有关的讨论和时间安排(目前).在适当的地方，最引人注目的是提到更好的替代方案.
处理后缀，删除多余的列，重命名输出以及其他特定用例.还有其他(阅读:更好)的文章可以解决这个问题，所以请弄清楚！

注意
除非另有说明，否则大多数示例在演示各种功能时会默认使用INNER JOIN操作.

此外，此处的所有DataFrame都可以复制和复制，因此你可以和他们一起玩.另请参阅此发布关于如何从剪贴板读取DataFrame的信息.

最后，所有JOIN操作的视觉表示都是使用Google绘图手工绘制的.来自此处的启发.

足够多的谈话，请教我如何使用`merge`！

设置

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

为简单起见，key列具有相同的名称(目前).

内部联接由

表示

注意
这以及即将出现的数字均遵循以下惯例:

蓝色表示合并结果中存在的行

红色表示从结果中排除(即已删除)的行

绿色表示结果中缺失的值被NaN取代

要执行内部联接，请调用 pd.merge 指定左侧的DataFrame，右侧的DataFrame和连接键.

pd.merge(left, right, on='key')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

这仅返回来自left和right的共享公共密钥的行(在本示例中为"B"和"D").

在较新的熊猫版本(v0.21左右)中，merge现在是一阶函数，因此您可以调用

在哪里

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

同样，对于除权利外的JOIN，

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

最后，如果要求您进行合并，而该合并仅保留左侧或右侧的键，而不同时保留两者(IOW，执行 ANTI-JOIN )，

您可以按照类似的方式进行操作-

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

关键列的不同名称

如果键列的名称不同(例如，left具有keyLeft，而right具有keyRight而不是key)，则必须将left_on和right_on指定为参数，而不是on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

在输出中避免重复的键列

在left和right上的keyLeft和keyRight上合并时，如果只想在输出中使用keyLeft或keyRight中的一个(但不能同时使用)，则可以通过设置索引作为初步步骤.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

将此与之前的命令输出(这是left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')的输出)进行对比，您会发现keyLeft丢失了.您可以根据将哪个帧的索引设置为关键来确定要保留的列.例如，当执行某些OUTER JOIN操作时，这可能很重要.

仅合并一个`DataFrames`

中的单个列

例如，考虑

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

如果只需要合并"new_val"(不包含任何其他列)，则通常可以在合并之前仅将其子集作为子集:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

如果您要进行LEFT OUTER JOIN，则更高性能的解决方案将涉及map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

如前所述，这类似于但快于

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

合并在多列上

要在一个以上的列上联接，请为on(或相应的left_on和right_on)指定一个列表.

left.merge(right, on=['key1', 'key2'] ...)

或者，如果名称不同，

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

**其他有用的`merge*`操作和功能**

合并具有序列索引的DataFrame :请参见此答案.
除了merge之外，和 DataFrame.combine_first 在某些情况下还用于与另一个DataFrame进行更新.
pd.merge_ordered 对于有序的JOIN是有用的功能.
pd.merge_asof (已阅读:merge_asOf)对于近似连接很有用.

本节仅介绍最基本的内容，目的只是为了激发您的胃口.有关更多示例和案例，请参见关于merge，join和concat 以及功能说明的链接.

基于索引的* -JOIN(+索引列`merge` s)

设置

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

通常，索引合并看起来像这样:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

对索引名称的支持

如果您的索引已命名，则v0.23用户还可以将级别名称指定为on(或根据需要指定left_on和right_on).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

合并一个索引，另一个索引列

可以(很简单)使用一个索引和另一个列进行合并.例如，

left.merge(right, left_on='key1', right_index=True)

反之亦然(right_on=...和left_index=True).

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

在这种特殊情况下，left的索引已命名，因此您也可以将索引名与left_on一起使用，如下所示:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
除了这些，还有另一个简洁的选择.您可以使用DataFrame.join，该默认默认为在索引上进行联接. DataFrame.join默认情况下不做左外部联接，因此在这里how='inner'是必需的.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

请注意，我需要指定lsuffix和rsuffix参数，因为join否则会出错:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

由于列名相同.如果它们的名称不同，这将不是问题.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
最后，作为基于索引的联接的替代方法，您可以使用pd.concat:

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

如果需要FULL OUTER JOIN(默认)，请省略join='inner':

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

有关更多信息，请参见

如果键(此处的键可以是列或索引)是唯一的，则可以使用pd.concat.请注意， pd.concat在索引上联接DataFrames .
# merge on `key` column, you'll need to set the index before concatenating pd.concat([ df.set_index('key') for df in dfs], axis=1, join='inner' ).reset_index() key valueA valueB valueC 0 D 2.240893 -0.977278 1.0 # merge on `key` index pd.concat(dfs2, axis=1, sort=False, join='inner') valueA valueB valueC key D 2.240893 -0.977278 1.0
忽略join='inner'进行完全外部联接.请注意，您不能指定LEFT或RIGHT OUTER联接(如果需要这些联接，请使用join，如下所述).

对具有重复项的键进行多路合并

concat速度很快，但也有缺点.它无法处理重复项.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner') ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
在这种情况下，我们可以使用join，因为它可以处理非唯一键(请注意，join在其索引上联接DataFrames；除非另有说明，否则它将在内部调用merge并执行LEFT OUTER JOIN ).
# join on `key` column, set as the index first # For inner join. For left join, omit the "how" argument. A.set_index('key').join( [df.set_index('key') for df in (B, C)], how='inner').reset_index() key valueA valueB valueC 0 D 2.240893 -0.977278 1.0 # join on `key` index A3.set_index('key').join([B2, C2], how='inner') valueA valueB valueC key D 1.454274 -0.977278 1.0 D 0.761038 -0.977278 1.0

How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?

How do I add NaNs for missing rows after merge?

How do I get rid of NaNs after merging?

Can I merge on the index?

Cross join with pandas?

How do I merge multiple DataFrames?

merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
解决方案
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

merging with different column names

avoiding duplicate merge key column in output

Merging with index under different conditions

effectively using your named index

merge key as the index of one and column of another

Multiway merges on columns and indexes (unique and non-unique)

Notable alternatives to merge and join

What this post will not go through:

Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.

Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough Talk, just show me how to use merge!

Setup
np.random.seed(0) left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)}) right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)}) left key value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893 right key value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note
This, along with the forthcoming figures all follow this convention:

blue indicates rows that are present in the merge result

red indicates rows that are excluded from the result (i.e., removed)

green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call pd.merge specifying the left DataFrame, the right DataFrame, and the join key.
pd.merge(left, right, on='key') key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).

In more recent versions of pandas (v0.21 or so), merge is now a first order function, so you can call DataFrame.merge.
left.merge(right, on='key') # Or, if you want to be explicit # left.merge(right, on='key', how='inner') key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278
A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.
left.merge(right, on='key', how='left') key value_x value_y 0 A 1.764052 NaN 1 B 0.400157 1.867558 2 C 0.978738 NaN 3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how='right':
left.merge(right, on='key', how='right') key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278 2 E NaN 0.950088 3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.
left.merge(right, on='key', how='outer') key value_x value_y 0 A 1.764052 NaN 1 B 0.400157 1.867558 2 C 0.978738 NaN 3 D 2.240893 -0.977278 4 E NaN 0.950088 5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarises these various merges nicely:

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,
(left.merge(right, on='key', how='left', indicator=True) .query('_merge == "left_only"') .drop('_merge', 1)) key value_x value_y 0 A 1.764052 NaN 2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True) key value_x value_y _merge 0 A 1.764052 NaN left_only 1 B 0.400157 1.867558 both 2 C 0.978738 NaN left_only 3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True) .query('_merge == "right_only"') .drop('_merge', 1)) key value_x value_y 2 E NaN 0.950088 3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True) .query('_merge != "both"') .drop('_merge', 1)) key value_x value_y 0 A 1.764052 NaN 2 C 0.978738 NaN 4 E NaN 0.950088 5 F NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1) right2 = right.rename({'key':'keyRight'}, axis=1) left2 keyLeft value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893 right2 keyRight value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner') keyLeft value_x keyRight value_y 0 B 0.400157 B 1.867558 1 D 2.240893 D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft') left3.merge(right2, left_index=True, right_on='keyRight') value_x keyRight value_y 0 0.400157 B 1.867558 1 2.240893 D -0.977278
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider
right3 = right.assign(newcol=np.arange(len(right))) right3 key value newcol 0 B 1.867558 0 1 D -0.977278 1 2 E 0.950088 2 3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key') key value newcol 0 B 0.400157 0 1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol'])) left.assign(newcol=left['key'].map(right3.set_index('key')['newcol'])) key value newcol 0 A 1.764052 NaN 1 B 0.400157 0.0 2 C 0.978738 NaN 3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left') key value newcol 0 A 1.764052 NaN 1 B 0.400157 0.0 2 C 0.978738 NaN 3 D 2.240893 1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

Merging a DataFrame with Series on index: See this answer.

Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.

pd.merge_ordered is a useful function for ordered JOINs.

pd.merge_asof (read: merge_asOf) is useful for approximate joins.

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specs.

Index-based *-JOIN (+ index-column merges)

Setup
np.random.seed([3, 14]) left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D']) right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F']) left.index.name = right.index.name = 'idxkey' left value idxkey A -0.602923 B -0.402655 C 0.302329 D -0.524349 right value idxkey B 0.543843 D 0.013135 E -0.326498 F 1.385076
Typically, a merge on index would look like this:
left.merge(right, left_index=True, right_index=True) value_x value_y idxkey B -0.402655 0.543843 D -0.524349 0.013135

Support for index names

If your index is named, then v0.23 users can also specify the level name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey') value_x value_y idxkey B -0.402655 0.543843 D -0.524349 0.013135

Merging on index of one, column(s) of another

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=... and left_index=True).
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1) right2 colkey value 0 B 0.543843 1 D 0.013135 2 E -0.326498 3 F 1.385076 left.merge(right2, left_index=True, right_on='colkey') value_x colkey value_y 0 -0.402655 B 0.543843 1 -0.524349 D 0.013135
In this special case, the index for left is named, so you can also use the index name with left_on, like this:
left.merge(right2, left_on='idxkey', right_on='colkey') value_x colkey value_y 0 -0.402655 B 0.543843 1 -0.524349 D 0.013135
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join which defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y') value_x value_y idxkey B -0.402655 0.543843 D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right) ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner') leftvalue value idxkey B -0.402655 0.543843 D -0.524349 0.013135
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:
pd.concat([left, right], axis=1, sort=False, join='inner') value value idxkey B -0.402655 0.543843 D -0.524349 0.013135
Omit join='inner' if you need a FULL OUTER JOIN (the default):
pd.concat([left, right], axis=1, sort=False) value value A -0.602923 NaN B -0.402655 0.543843 C 0.302329 NaN D -0.524349 0.013135 E NaN -0.326498 F NaN 1.385076
For more information, see this canonical post on pd.concat by @piRSquared.

Generalizing: mergeing multiple DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup. np.random.seed(0) A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)}) B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)}) C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)}) dfs = [A, B, C] # Note, the "key" column values are unique, so the index is unique. A2 = A.set_index('key') B2 = B.set_index('key') C2 = C.set_index('key') dfs2 = [A2, B2, C2]

Multiway merge on unique keys (or index)

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# merge on `key` column, you'll need to set the index before concatenating pd.concat([ df.set_index('key') for df in dfs], axis=1, join='inner' ).reset_index() key valueA valueB valueC 0 D 2.240893 -0.977278 1.0 # merge on `key` index pd.concat(dfs2, axis=1, sort=False, join='inner') valueA valueB valueC key D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).

Multiway merge on keys with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner') ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# join on `key` column, set as the index first # For inner join. For left join, omit the "how" argument. A.set_index('key').join( [df.set_index('key') for df in (B, C)], how='inner').reset_index() key valueA valueB valueC 0 D 2.240893 -0.977278 1.0 # join on `key` index A3.set_index('key').join([B2, C2], how='inner') valueA valueB valueC key D 1.454274 -0.977278 1.0 D 0.761038 -0.977278 1.0

这篇关于 pandas 合并101的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 合并101 [英] Pandas Merging 101

问题描述

足够多的谈话，请教我如何使用`merge`！

关键列的不同名称

在输出中避免重复的键列

仅合并一个`DataFrames`

合并在多列上

**其他有用的`merge*`操作和功能**

基于索引的* -JOIN(+索引列`merge` s)

对索引名称的支持

合并一个索引，另一个索引列

对唯一键(或索引)进行多路合并

对具有重复项的键进行多路合并

Enough Talk, just show me how to use `merge`!

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

Different names for key columns

Avoiding duplicate key column in output

Merging only a single column from one of the `DataFrames`

Merging on multiple columns

**Other useful `merge*` operations and functions**

Index-based *-JOIN (+ index-column `merge`s)

Support for index names

Merging on index of one, column(s) of another

Generalizing: `merge`ing multiple DataFrames

Multiway merge on unique keys (or index)

Multiway merge on keys with duplicates

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 合并101 [英] Pandas Merging 101

问题描述

足够多的谈话，请教我如何使用merge！

关键列的不同名称

在输出中避免重复的键列

仅合并一个DataFrames

合并在多列上

其他有用的merge*操作和功能

基于索引的* -JOIN(+索引列merge s)

对索引名称的支持

合并一个索引，另一个索引列

对唯一键(或索引)进行多路合并

对具有重复项的键进行多路合并

Enough Talk, just show me how to use merge!

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

Different names for key columns

Avoiding duplicate key column in output

Merging only a single column from one of the DataFrames

Merging on multiple columns

Other useful merge* operations and functions

Index-based *-JOIN (+ index-column merges)

Support for index names

Merging on index of one, column(s) of another

Generalizing: mergeing multiple DataFrames

Multiway merge on unique keys (or index)

Multiway merge on keys with duplicates

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

足够多的谈话，请教我如何使用`merge`！

仅合并一个`DataFrames`

**其他有用的`merge*`操作和功能**

基于索引的* -JOIN(+索引列`merge` s)

Enough Talk, just show me how to use `merge`!

Merging only a single column from one of the `DataFrames`

**Other useful `merge*` operations and functions**

Index-based *-JOIN (+ index-column `merge`s)

Generalizing: `merge`ing multiple DataFrames

登录关闭