如何在字符串包含上合并大 pandas ? [英] How to merge pandas on string contains?

查看:94
本文介绍了如何在字符串包含上合并大 pandas ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据框,我想合并到一个公共列上.但是,我要合并的列不是同一字符串,而是另一个中包含一个字符串,如下所示:

I have 2 dataframes that I would like to merge on a common column. However the column I would like to merge on are not of the same string, but rather a string from one is contained in the other as so:

import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})

df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})

我希望从d1.merge(d2, ...)获得的结果如下:

The outcome I would like from d1.merge(d2, ...) is the following:

column_a  |  column_b
----------------------
John      |  Moore    <- merged on 'code' contained in 'some code' 
Michael   |  Cohen    <- merged on 'other' contained in 'other string'  
Dan       |  Smith    <- merged on 'ome' contained in 'some string'  
George    |  n/a
Adam      |  Faber    <- merged on 'word' contained in 'word'  

推荐答案

新答案

这是一种基于pandas/numpy的方法.

New Answer

Here is one approach based on pandas/numpy.

rhs = (df1.column_common
          .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b'])
          .bfill(axis=1)
          .iloc[:, 0])

(pd.concat([df1.column_a, rhs], axis=1, ignore_index=True)
 .rename(columns={0: 'column_a', 1: 'column_b'}))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3   George      NaN
4     Adam    Faber

旧答案

这是左联接行为的一种解决方案,因为它不会保留与任何column_b值都不匹配的column_a值.这比上面的numpy/pandas解决方案要慢,因为它使用两个嵌套的iterrows循环来构建python列表.

Old Answer

Here's a solution for left-join behaviour, as in it doesn't keep column_a values that do not match any column_b values. This is slower than the above numpy/pandas solution because it uses two nested iterrows loops to build a python list.

tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() 
                 for j, (a2, b2) in df2.iterrows()
        if b1 in b2]

(pd.DataFrame(tups, columns=['column_a', 'column_b'])
   .drop_duplicates('column_a')
   .reset_index(drop=True))

  column_a column_b
0     John    Moore
1  Michael    Cohen
2      Dan    Smith
3     Adam    Faber

这篇关于如何在字符串包含上合并大 pandas ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆