pandas 合并 101 [英] Pandas Merging 101

查看:24
本文介绍了 pandas 合并 101的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 如何执行 (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN 与熊猫?
  • 如何在合并后为缺失的行添加 NaN?
  • 合并后如何去除 NaN?
  • 我可以合并索引吗?
  • 如何合并多个 DataFrame?
  • 与熊猫交叉连接
  • 合并?加入?连接?更新?WHO?什么?为什么?!

……等等.我已经看到这些反复出现的问题询问熊猫合并功能的各个方面.今天关于合并及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中.这里的目的是为后代整理一些更重要的观点.

本问答旨在成为一系列关于常见 Pandas 习语的有用用户指南的下一部分(参见 ,指定右侧 DataFrame 和连接键(至少)作为参数.

left.merge(right, on='key')# 或者,如果你想明确# left.merge(right, on='key', how='inner')键值_x 值_y0 乙 0.400157 1.8675581 D 2.240893 -0.977278

这仅返回来自 leftright 的行,它们共享一个公共键(在本例中为B"和D").

A LEFT OUTER JOIN 或 LEFT JOIN 由

表示

这可以通过指定 how='left' 来执行.

left.merge(right, on='key', how='left')键值_x 值_y0 A 1.764052 NaN1 乙 0.400157 1.8675582 C 0.978738 NaN3D 2.240893 -0.977278

请仔细注意此处 NaN 的位置.如果您指定 how='left',则仅使用 left 中的键,right 中缺失的数据将替换为 NaN.>

同样,对于RIGHT OUTER JOIN,或RIGHT JOIN,这是...

...指定how='right':

left.merge(right, on='key', how='right')键值_x 值_y0 乙 0.400157 1.8675581 D 2.240893 -0.9772782 E NaN 0.9500883 F NaN -0.151357

这里使用了 right 的键,left 的缺失数据被 NaN 替换.

最后,对于FULL OUTER JOIN,由

给出

指定how='outer'.

left.merge(right, on='key', how='outer')键值_x 值_y0 A 1.764052 NaN1 乙 0.400157 1.8675582 C 0.978738 NaN3D 2.240893 -0.9772784 E NaN 0.9500885 F NaN -0.151357

这使用了两个帧中的键,并且为丢失的行插入了 NaN.

文档很好地总结了这些不同的合并:


其他联接 - 左排除、右排除和完全排除/反联接

如果您需要分两步排除左连接排除右连接.

对于LEFT-Excluded JOIN,表示为

首先执行 LEFT OUTER JOIN,然后过滤(排除!)来自 left 的行,

(left.merge(right, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge', 1))键值_x 值_y0 A 1.764052 NaN2 C 0.978738 NaN

哪里,

left.merge(right, on='key', how='left', indicator=True)键值_x 值_y _merge0 A 1.764052 NaN left_only1 B 0.400157 1.867558 两者2 C 0.978738 NaN left_only3 D 2.240893 -0.977278 两者

类似地,对于 RIGHT-Excluded JOIN,

(left.merge(right, on='key', how='right', indicator=True).query('_merge == "right_only"').drop('_merge', 1))键值_x 值_y2 E NaN 0.9500883 F NaN -0.151357

最后,如果您需要进行合并,只保留左侧或右侧的键,但不能同时保留两者(IOW,执行 ANTI-JOIN),

你可以用类似的方式做到这一点——

(left.merge(right, on='key', how='outer', indicator=True).query('_merge != "both"').drop('_merge', 1))键值_x 值_y0 A 1.764052 NaN2 C 0.978738 NaN4 E NaN 0.9500885 F NaN -0.151357


键列的不同名称

如果键列的名称不同——例如,leftkeyLeftrightkeyRight而不是 key——那么你必须指定 left_onright_on 作为参数而不是 on:

left2 = left.rename({'key':'keyLeft'},axis=1)right2 = right.rename({'key':'keyRight'},axis=1)左2键左值0 A 1.7640521 乙 0.4001572 C 0.9787383D 2.240893右2键值0 乙 1.8675581 D -0.9772782 E 0.9500883 楼 -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')keyLeft value_x keyRight value_y0 乙 0.400157 乙 1.8675581 D 2.240893 D -0.977278


避免输出中重复的键列

当在 leftkeyLeftrightkeyRight 上合并时,如果您只想要 keyLeftkeyRight(但不是两者),您可以先设置索引作为初步步骤.

left3 = left2.set_index('keyLeft')left3.merge(right2, left_index=True, right_on='keyRight')value_x keyRight value_y0 0.400157 乙 1.8675581 2.240893 D -0.977278

将此与之前命令的输出(即 left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner') 的输出进行对比)),您会注意到 keyLeft 丢失了.您可以根据将哪个帧的索引设置为键来确定要保留的列.例如,在执行某些 OUTER JOIN 操作时,这可能很重要.


仅合并 DataFrames

中的一列

例如,考虑

right3 = right.assign(newcol=np.arange(len(right)))右3键值 newcol0 乙 1.867558 01 D -0.977278 12 E 0.950088 23 F -0.151357 3

如果您只需要合并new_val"(没有任何其他列),您通常可以在合并之前对列进行子集:

left.merge(right3[['key', 'newcol']], on='key')键值 newcol0 乙 0.400157 01 d 2.240893 1

如果你在做一个LEFT OUTER JOIN,一个更高性能的解决方案将涉及map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))键值 newcol0 A 1.764052 NaN1 乙 0.400157 0.02 C 0.978738 NaN3D 2.240893 1.0

如前所述,这类似于,但比

left.merge(right3[['key', 'newcol']], on='key', how='left')键值 newcol0 A 1.764052 NaN1 乙 0.400157 0.02 C 0.978738 NaN3D 2.240893 1.0


多列合并

要加入多个列,请为on(或left_onright_on,视情况而定)指定一个列表.

left.merge(right, on=['key1', 'key2'] ...)

或者,如果名称不同,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


其他有用的merge*操作和函数

本节仅涵盖非常基础的内容,旨在满足您的胃口.有关更多示例和案例,请参阅关于 merge文档,joinconcat 以及功能规范的链接.



继续阅读

跳转到 Pandas Merging 101 中的其他主题以继续学习:

*你在这里.

  • How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
  • How do I add NaNs for missing rows after a merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • How do I merge multiple DataFrames?
  • Cross join with pandas
  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.


Table of Contents

For ease of access.

解决方案

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • merging with multiple columns
    • avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.



Enough talk - just show me how to use merge!

Setup & Basics

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how='right':

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely:


Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357


Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278


Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.


Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0


Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])


Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.



Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

*You are here.

这篇关于 pandas 合并 101的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆