pandas :如何在key1.str.endswith(key2)上合并2个数据框 [英] Pandas : how to merge 2 dataframes on key1.str.endswith(key2)

查看:83
本文介绍了 pandas :如何在key1.str.endswith(key2)上合并2个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到在key1.str.endswith(key2)上合并2个数据帧的最佳方法,一个示例有时胜于单词:

i want to find the best way to merge 2 dataframes on key1.str.endswith(key2), an example is sometimes better than words:

 i want to merge df1 and df2  on product.str.endswith(color)

 df1:
    index product
    1     a208-BLACK 
    2     a2008-WHITE
    3     x307-PEARL-WHITE
    4     aa-b307-WHITE

 df2:
    index color       code
    1     BLACK       X1001
    2     WHITE       X7005
    3     PEARL-WHITE X7055

获得:

 df:
    index product            code
    1     a208-BLACK         X1001
    2     a2008-WHITE        X7005
    3     x307-PEARL-WHITE   X7055
    4     aa-b307-WHITE      X7005

有什么主意吗?

推荐答案

我不是正则表达式专家,最后一个是处理起来最棘手的人,但是可以进行以下工作:

I'm not a regex expert, the last one was the trickiest one to handle but the following works:

In [402]:

df['code'] = df['product'].str.split('-').str[1:].str.join('-').str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
df
Out[402]:
                product   code
index                         
1            a208-BLACK  X1001
2           a2008-WHITE  X7005
3      x307-PEARL-WHITE  X7055
4         aa-b307-WHITE  X7005

基本上,我在-上拆分产品代码,并将所有元素都放在第一个破折号的右边.

Basically I split the product code on - and take all the elements to the right of the first dash.

这留下了:

In [403]:

df['product'].str.split('-').str[1:]
Out[403]:
index
1               [BLACK]
2               [WHITE]
3        [PEARL, WHITE]
4         [b307, WHITE]
Name: product, dtype: object

然后我将破折号放回去,使用正则表达式仅查找大写字母字符,这将处理最后一个字母,然后再次加入.

I then put the dash back, use a regex to find only uppercase alpha characters, this deals with the last one, rejoin again.

最后一位是在color列上设置索引后在另一个df上调用此映射,这将对df中的颜色值执行查找并返回相应的代码.

The last bit is to call map on this on the other df after setting the index on the color column, this will perform a lookup on the color value in df and return the corresponding code.

regex并非万无一失,但它适用于您的数据集.

The regex isn't foolproof but it works for your dataset.

编辑

我现在意识到我们不需要那么多联接:

I now realise we don't need that many joins:

In [409]:

df['code'] = df['product'].str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])
df
Out[409]:
                product   code
index                         
1            a208-BLACK  X1001
2           a2008-WHITE  X7005
3      x307-PEARL-WHITE  X7055
4         aa-b307-WHITE  X7005

时间

In [414]:


%%timeit 
import re
df['color'] = df['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))

pd.merge(df, df1, on='color')
1 loops, best of 3: 4.09 ms per loop
In [416]:

%%timeit
df['code'] = df['product'].str.findall(r'[A-Z]+').str.join('-').map(df1.set_index('color')['code'])

100 loops, best of 3: 1.63 ms per loop

str方法比使用lambda快2倍以上,这并不奇怪,因为str方法像调用map一样被矢量化了.

The str method is over 2X faster than using the lambda, this may not be so surprising as the str methods are vectorised as is calling map.

更新的时间

In [7]:

%%timeit
df1['color'] = df1['product'].str.extract(r'-([A-Z-]+)$')
pd.merge(df1, df2)
100 loops, best of 3: 4.51 ms per loop
In [9]:

%%timeit
df1['code'] = df1['product'].str.findall(r'[A-Z]+').str.join('-').map(df2.set_index('color')['code'])
100 loops, best of 3: 3.87 ms per loop
In [10]:

%%timeit 
import re
df1['color'] = df1['product'].apply(lambda x: re.sub('^[^ALPHA:]*-(.*)', '\\1', x))

pd.merge(df1, df2, on='color')
100 loops, best of 3: 4.79 ms per loop

所以@unutbu的答案比@colonel beaveau的答案稍快,但是在这里使用map仍然更快.

So @unutbu's answer is marginally faster than @colonel beaveau's but using map here is faster still.

实际上,如果将@unutbu的regex str方法与map结合使用,我们将比原始方法更快:

In fact if we combine @unutbu's regex str method with map we get faster than my original method:

In [12]:

%%timeit
df1['product'].str.extract(r'-([A-Z-]+)$').map(df2.set_index('color')['code'])
100 loops, best of 3: 2.17 ms per loop

因此,使用map的速度比合并快近2倍

So using map here is nearly 2X faster than merging

这篇关于 pandas :如何在key1.str.endswith(key2)上合并2个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆