pandas :如何在key1.str.endswith(key2)上合并2个数据框 [英] Pandas : how to merge 2 dataframes on key1.str.endswith(key2)
问题描述
我想找到在key1.str.endswith(key2)上合并2个数据帧的最佳方法,一个示例有时胜于单词:
i want to find the best way to merge 2 dataframes on key1.str.endswith(key2), an example is sometimes better than words:
i want to merge df1 and df2 on product.str.endswith(color)
df1:
index product
1 a208-BLACK
2 a2008-WHITE
3 x307-PEARL-WHITE
4 aa-b307-WHITE
df2:
index color code
1 BLACK X1001
2 WHITE X7005
3 PEARL-WHITE X7055
获得:
df:
index product code
1 a208-BLACK X1001
2 a2008-WHITE X7005
3 x307-PEARL-WHITE X7055
4 aa-b307-WHITE X7005
有什么主意吗?
推荐答案
我不是正则表达式专家,最后一个是处理起来最棘手的人,但是可以进行以下工作:
I'm not a regex expert, the last one was the trickiest one to handle but the following works:
In [402]:
df['code'] = df['product'].str.split('-').str[1:].str.join('-').str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
df
Out[402]:
product code
index
1 a208-BLACK X1001
2 a2008-WHITE X7005
3 x307-PEARL-WHITE X7055
4 aa-b307-WHITE X7005
基本上,我在-
上拆分产品代码,并将所有元素都放在第一个破折号的右边.
Basically I split the product code on -
and take all the elements to the right of the first dash.
这留下了:
In [403]:
df['product'].str.split('-').str[1:]
Out[403]:
index
1 [BLACK]
2 [WHITE]
3 [PEARL, WHITE]
4 [b307, WHITE]
Name: product, dtype: object
然后我将破折号放回去,使用正则表达式仅查找大写字母字符,这将处理最后一个字母,然后再次加入.
I then put the dash back, use a regex to find only uppercase alpha characters, this deals with the last one, rejoin again.
最后一位是在color列上设置索引后在另一个df上调用此映射,这将对df中的颜色值执行查找并返回相应的代码.
The last bit is to call map on this on the other df after setting the index on the color column, this will perform a lookup on the color value in df and return the corresponding code.
regex并非万无一失,但它适用于您的数据集.
The regex isn't foolproof but it works for your dataset.
编辑
我现在意识到我们不需要那么多联接:
I now realise we don't need that many joins:
In [409]:
df['code'] = df['product'].str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
df
Out[409]:
product code
index
1 a208-BLACK X1001
2 a2008-WHITE X7005
3 x307-PEARL-WHITE X7055
4 aa-b307-WHITE X7005
时间
In [414]:
%%timeit
import re
df['color'] = df['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))
pd.merge(df, df1, on='color')
1 loops, best of 3: 4.09 ms per loop
In [416]:
%%timeit
df['code'] = df['product'].str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
100 loops, best of 3: 1.63 ms per loop
str方法比使用lambda快2倍以上,这并不奇怪,因为str
方法像调用map
一样被矢量化了.
The str method is over 2X faster than using the lambda, this may not be so surprising as the str
methods are vectorised as is calling map
.
更新的时间
In [7]:
%%timeit
df1['color'] = df1['product'].str.extract(r'-([A-Z-]+)$')
pd.merge(df1, df2)
100 loops, best of 3: 4.51 ms per loop
In [9]:
%%timeit
df1['code'] = df1['product'].str.findall(r'[A-Z]+').str.join('-').map(df2.set_index('color')['code'])
100 loops, best of 3: 3.87 ms per loop
In [10]:
%%timeit
import re
df1['color'] = df1['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))
pd.merge(df1, df2, on='color')
100 loops, best of 3: 4.79 ms per loop
所以@unutbu的答案比@colonel beaveau的答案稍快,但是在这里使用map仍然更快.
So @unutbu's answer is marginally faster than @colonel beaveau's but using map here is faster still.
实际上,如果将@unutbu的regex str
方法与map结合使用,我们将比原始方法更快:
In fact if we combine @unutbu's regex str
method with map we get faster than my original method:
In [12]:
%%timeit
df1['product'].str.extract(r'-([A-Z-]+)$').map(df2.set_index('color')['code'])
100 loops, best of 3: 2.17 ms per loop
因此,使用map
的速度比合并快近2倍
So using map
here is nearly 2X faster than merging
这篇关于 pandas :如何在key1.str.endswith(key2)上合并2个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!