pandas :加入部分字符串匹配,例如Excel VLOOKUP [英] Pandas: join on partial string match, like Excel VLOOKUP

查看:309
本文介绍了 pandas :加入部分字符串匹配,例如Excel VLOOKUP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Python中执行与Excel中的VLOOKUP非常相似的操作.在StackOverflow上有很多与此相关的问题,但是它们都与本用例略有不同.希望任何人都可以引导我朝着正确的方向前进.我有以下两个熊猫数据框:

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. There have been many questions related to this on StackOverflow but they are all slightly different from this use case. Hopefully anyone can guide me in the right direction. I have the following two pandas dataframes:

df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
                    'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
                    'Type': ['01', '03', '04', '02'],
                    'Amount': ['150', '175', '160', '180'],
                    'Comment': ['bla', 'bla', 'bla', 'bla']})

print(df1)
    Invoice Currency
0   20561   EUR
1   20562   EUR
2   20563   EUR
3   20564   USD

print(df2)
    Ref         Type    Amount  Comment
0   20561       01      150     bla
1   INV20562    03      175     bla
2   INV20563BG  04      160     bla
3   20564       02      180     bla

现在,我想创建一个新的数据框(df3),在此我根据发票编号将两者合并.问题在于,发票编号在df2 ['Ref']中并不总是完全匹配",而有时却是部分匹配".因此,发票"上的合并未提供所需的输出,因为它不会复制发票20562&的数据. 20563,请参见下文:

Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers. The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref']. So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:

df3 = df1.join(df2.set_index('Ref'), on='Invoice')

print(df3)
    Invoice Currency    Type    Amount  Comment
0   20561   EUR         01       150    bla
1   20562   EUR         NaN      NaN    NaN
2   20563   EUR         NaN      NaN    NaN
3   20564   USD         02       180    bla

有没有办法参加部分比赛?我知道如何用正则表达式清理" df2 ['Ref'],但这不是我要的解决方案.使用for循环,我可以走很长一段路,但这不是Pythonic.

Is there a way to join on a partial match? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after. With a for loop, I get a long way but this isn't very Pythonic.

df4 = df1.copy()
for i, row in df1.iterrows():
    tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
    df4.loc[i, 'Amount'] = tmp['Amount'].values[0]

print(df4)
Invoice     Currency    Amount
0   20561   EUR         150
1   20562   EUR         175
2   20563   EUR         160
3   20564   USD         180

可以以某种更优雅的方式使用str.contains()吗?提前非常感谢您的帮助!

Can str.contains() somehow be used in a more elegant way? Thank you so much in advance for your help!

推荐答案

这是使用

This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.

df4 = df1.copy()

def get_amount(x):
    return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

这篇关于 pandas :加入部分字符串匹配,例如Excel VLOOKUP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆