pandas :加入部分字符串匹配,例如Excel VLOOKUP [英] Pandas: join on partial string match, like Excel VLOOKUP
问题描述
我正在尝试在Python中执行与Excel中的VLOOKUP非常相似的操作.在StackOverflow上有很多与此相关的问题,但是它们都与本用例略有不同.希望任何人都可以引导我朝着正确的方向前进.我有以下两个熊猫数据框:
I am trying to perform an action in Python which is very similar to VLOOKUP in Excel. There have been many questions related to this on StackOverflow but they are all slightly different from this use case. Hopefully anyone can guide me in the right direction. I have the following two pandas dataframes:
df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
'Type': ['01', '03', '04', '02'],
'Amount': ['150', '175', '160', '180'],
'Comment': ['bla', 'bla', 'bla', 'bla']})
print(df1)
Invoice Currency
0 20561 EUR
1 20562 EUR
2 20563 EUR
3 20564 USD
print(df2)
Ref Type Amount Comment
0 20561 01 150 bla
1 INV20562 03 175 bla
2 INV20563BG 04 160 bla
3 20564 02 180 bla
现在,我想创建一个新的数据框(df3),在此我根据发票编号将两者合并.问题在于,发票编号在df2 ['Ref']中并不总是完全匹配",而有时却是部分匹配".因此,发票"上的合并未提供所需的输出,因为它不会复制发票20562&的数据. 20563,请参见下文:
Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers. The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref']. So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:
df3 = df1.join(df2.set_index('Ref'), on='Invoice')
print(df3)
Invoice Currency Type Amount Comment
0 20561 EUR 01 150 bla
1 20562 EUR NaN NaN NaN
2 20563 EUR NaN NaN NaN
3 20564 USD 02 180 bla
有没有办法参加部分比赛?我知道如何用正则表达式清理" df2 ['Ref'],但这不是我要的解决方案.使用for循环,我可以走很长一段路,但这不是Pythonic.
Is there a way to join on a partial match? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after. With a for loop, I get a long way but this isn't very Pythonic.
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0]
print(df4)
Invoice Currency Amount
0 20561 EUR 150
1 20562 EUR 175
2 20563 EUR 160
3 20564 USD 180
可以以某种更优雅的方式使用str.contains()吗?提前非常感谢您的帮助!
Can str.contains() somehow be used in a more elegant way? Thank you so much in advance for your help!
推荐答案
This is one way using pd.Series.apply
, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.
df4 = df1.copy()
def get_amount(x):
return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]
df4['Amount'] = df4['Invoice'].apply(get_amount)
print(df4)
Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180
这篇关于 pandas :加入部分字符串匹配,例如Excel VLOOKUP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!