返回列,其中包含字符串列中存在的关键字列表-Pandas [英] Return Column with list of Keywords present in String Column - Pandas
问题描述
我有一个列表关键字和一个数据框:
I have a list keywords and a dataframe:
keywords=['chair','table', 'fan']
Description
The table is 6 inches long
The fan is really good
The table fan is cheap
The chair is broken
The chair is on the table
我想搜索关键字列表并创建一个新列,其中Description
列中包含该列表中的关键字.
I want to search the list of keywords and create a new column with which keyword from the list is present in the Description
column.
Description Keyword
The table is 6 inches long table
The fan is really good fan
The table fan is cheap table, fan
The chair is broken chair
The chair is on the table chair, table
我已经搜索了一些解决方案,但是似乎都不起作用.我自己尝试了以下代码:
I have searched for a few solutions, but none of them seems to work. I tried the following code on my own:
for i in word_set:
for x in range(0, len(df)):
if(df['Event Message'][x] in (i)):
df['word'] = i
但是很明显,时间复杂度太高并且要花费大量时间.任何帮助将不胜感激.
But obviously the time complexity is too high and is taking lot of time. Any help would be appreciated.
推荐答案
使用 Series.str.join
并按正则表达式或-|
:
keywords=['chair','table', 'fan']
df['Keyword'] = df['Description'].str.findall('|'.join(keywords)).apply(set).str.join(', ')
print (df)
Description Keyword
0 The table is 6 inches long table
1 The fan is really good fan
2 The table fan is cheap table, fan
3 The chair is broken chair
4 The chair is on the table chair, table
如果需要单词边界以避免提取子内容:
If need words boundaries for avoid extract subtrings:
keywords=['chair','tab', 'fan']
pat = '|'.join(r"\b{}\b".format(x) for x in keywords)
df['Keyword1'] = df['Description'].str.findall(pat).apply(set).str.join(', ')
df['Keyword2'] = df['Description'].str.findall('|'.join(keywords)).apply(set).str.join(', ')
print (df)
Description Keyword1 Keyword2
0 The table is 6 inches long tab
1 The fan is really good fan fan
2 The table fan is cheap fan tab, fan
3 The chair is broken chair chair
4 The chair is on the table chair chair, tab
为了提高性能,可以使用自定义功能,并在其中设置拆分和测试成员资格:
For improve performance is possible use custom function with split and test membership in set:
keywords=['chair','table', 'fan']
s = set(keywords)
f = lambda x: ', '.join(set([y for y in x.split() if y in s]))
df['Keyword1'] = df['Description'].apply(f)
列表理解也应该更快:
df['Keyword1'] = [', '.join(set([y for y in x.split() if y in s])) for x in df['Description']]
print (df)
Description Keyword1
0 The table is 6 inches long table
1 The fan is really good fan
2 The table fan is cheap fan, table
3 The chair is broken chair
4 The chair is on the table table, chair
谢谢@Henry Yik使用set.intersection
的另一种解决方案:
Thank you, @Henry Yik for another solution with set.intersection
:
df['Keyword1'] = df['Description'].apply(lambda x: ', '.join(set(x.split()).intersection(s)))
这篇关于返回列,其中包含字符串列中存在的关键字列表-Pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!