Python Pandas-基于字符串中的子字符串合并 [英] Python Pandas - Merge based on substring in string

查看:360
本文介绍了Python Pandas-基于字符串中的子字符串合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据框,格式如下:

I have 2 dataframes with the following format:

df_search

SEARCH
part1
anotherpart
onemorepart


df_all

FILE             EXTENSION    PATH
part1_1         .prt    //server/folder1/part1_1
part1_2         .prt    //server/folder2/part1_2
part1_2         .pdf    //server/folder3/part1_2
part1_3         .prt    //server/folder2/part1_3
anotherpart_1   .prt    //server/folder1/anotherpart_1
anotherpart_2   .prt    //server/folder3/anotherpart_2
anotherpart_3   .prt    //server/folder2/anotherpart_3
anotherpart_3   .cgm    //server/folder1/anotherpart_3
anotherpart_4   .prt    //server/folder3/anotherpart_4
onemorepart_1   .prt    //server/folder2/onemorepart_1
onemorepart_2   .prt    //server/folder1/onemorepart_2
onemorepart_2   .dwg    //server/folder2/onemorepart_2
onemorepart_3   .prt    //server/folder1/onemorepart_3
onemorepart_4   .prt    //server/folder1/onemorepart_4

完整的df_search有15,000个项目. df_all有550,000个项目.我正在尝试基于文件字符串中的搜索项字符串来合并两个数据框.我想要的输出是这样:

The full df_search has 15,000 items. df_all has 550,000 items. I am trying to merge the two dataframes based on the search item string being in the file string. My desired output is this:

SEARCH       FILE            EXTENSION  PATH    
part1        part1_1        .prt    //server/folder1/part1_1    
part1        part1_2        .prt    //server/folder2/part1_2    
part1        part1_2        .pdf    //server/folder3/part1_2    
part1        part1_3        .prt    //server/folder2/part1_3    
anotherpart anotherpart_1   .prt    //server/folder1/anotherpart_1  
anotherpart anotherpart_2   .prt    //server/folder3/anotherpart_2  
anotherpart anotherpart_3   .prt    //server/folder2/anotherpart_3  
anotherpart anotherpart_3   .cgm    //server/folder1/anotherpart_3  
anotherpart anotherpart_4   .prt    //server/folder3/anotherpart_4  
onemorepart onemorepart_1   .prt    //server/folder2/onemorepart_1  
onemorepart onemorepart_2   .prt    //server/folder1/onemorepart_2  
onemorepart onemorepart_2   .dwg    //server/folder2/onemorepart_2  
onemorepart onemorepart_3   .prt    //server/folder1/onemorepart_3  
onemorepart onemorepart_4   .prt    //server/folder1/onemorepart_4  

简单的数据框合并不起作用,因为字符串永远不会完全匹配(它始终是子字符串).我还根据其他关于stackoverflow的问题尝试了以下方法:

A simple dataframe merge does not work, because the strings are never exact matches (it is always a substring). I also tried the following method based on other questions here on stackoverflow:

df_all[df_all.name.str.contains('|'.join(df_search.search))]

这给了我df_all中所有找到的项目的完整列表,但是我不知道哪个搜索字符串返回了哪个结果.

This gave me a full list of all the found items in df_all, but i don't know which search string returned which result.

我设法使其与for循环一起使用,但是我的数据集运行缓慢(67分钟):

I managed to get it to work with a for loop, but it is slow (67 minutes) with my dataset:

super_df = []
for search_item in df_search.search:
     df_entire.loc[df_entire.file.str.contains(search_item), 'search'] = search_item
     temp_df = df_entire[df_entire.file.str.contains(search_item)]
super_df = pd.concat(super_df, axis=0, ignore_index=True)

是否可以通过矢量化来做到这一点以提高性能?

Is it possible to do this with vectorisation to improve performance?

谢谢

推荐答案

使用

这篇关于Python Pandas-基于字符串中的子字符串合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆