Pandas:加入部分字符串匹配,如 Excel VLOOKUP [英] Pandas: join on partial string match, like Excel VLOOKUP

查看:31
本文介绍了Pandas:加入部分字符串匹配,如 Excel VLOOKUP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Python 中执行一个与 Excel 中的 VLOOKUP 非常相似的操作.在 StackOverflow 上有很多与此相关的问题,但它们都与这个用例略有不同.希望任何人都可以指导我朝着正确的方向前进.我有以下两个熊猫数据框:

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. There have been many questions related to this on StackOverflow but they are all slightly different from this use case. Hopefully anyone can guide me in the right direction. I have the following two pandas dataframes:

df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
                    'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
                    'Type': ['01', '03', '04', '02'],
                    'Amount': ['150', '175', '160', '180'],
                    'Comment': ['bla', 'bla', 'bla', 'bla']})

print(df1)
    Invoice Currency
0   20561   EUR
1   20562   EUR
2   20563   EUR
3   20564   USD

print(df2)
    Ref         Type    Amount  Comment
0   20561       01      150     bla
1   INV20562    03      175     bla
2   INV20563BG  04      160     bla
3   20564       02      180     bla

现在我想创建一个新的数据框 (df3),根据发票编号将两者结合起来.问题是发票编号并不总是完全匹配",有时是 df2['Ref'] 中的部分匹配".因此,加入发票"不会提供所需的输出,因为它不会复制发票 20562 & 的数据.20563,见下图:

Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers. The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref']. So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:

df3 = df1.join(df2.set_index('Ref'), on='Invoice')

print(df3)
    Invoice Currency    Type    Amount  Comment
0   20561   EUR         01       150    bla
1   20562   EUR         NaN      NaN    NaN
2   20563   EUR         NaN      NaN    NaN
3   20564   USD         02       180    bla

有没有办法加入部分比赛?我知道如何使用正则表达式清理" df2['Ref'] ,但这不是我想要的解决方案.使用 for 循环,我有很长的路要走,但这不是很 Pythonic.

Is there a way to join on a partial match? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after. With a for loop, I get a long way but this isn't very Pythonic.

df4 = df1.copy()
for i, row in df1.iterrows():
    tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
    df4.loc[i, 'Amount'] = tmp['Amount'].values[0]

print(df4)
Invoice     Currency    Amount
0   20561   EUR         150
1   20562   EUR         175
2   20563   EUR         160
3   20564   USD         180

str.contains() 可以以某种更优雅的方式使用吗?非常感谢您的帮助!

Can str.contains() somehow be used in a more elegant way? Thank you so much in advance for your help!

推荐答案

这是使用 pd.Series.apply,这只是一个隐蔽的循环.部分字符串合并"就是您要查找的内容,我不确定它是否以矢量化形式存在.

This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.

df4 = df1.copy()

def get_amount(x):
    return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

这篇关于Pandas:加入部分字符串匹配,如 Excel VLOOKUP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆