通过在字符串列中查找确切的单词来创建新列 [英] Creating a new column by finding exact word in a column of strings
问题描述
如果列表中的任何单词与dataframe字符串列完全匹配,我想用1或0创建一个新列.
I want to create a new column with 1 or 0, if any of the words in a list is matched exaclty with the dataframe string column.
list_provided=["mul","the"]
#how my dataframe looks
id text
a simultaneous there the
b simultaneous there
c mul why
预期输出
id text found
a simultaneous there the 1
b simultaneous there 0
c mul why 1
第二行被分配为0,因为 "mul"或"the"在字符串列"text"中都不完全匹配
Second row is assigned 0, since either of "mul" or "the" are not exactly matching in the string column "text"
代码尝试到现在为止
#For exact match I am using the below code
data["Found"]=np.where(data["text"].str.contains(r'(?:\s|^)penalidades(?:\s|$)'),1,0)
如何遍历循环以查找提供的单词列表中所有单词的完全匹配?
How can I iterate through a loop to find exact match of all the words in the provided list of words?
修改: 如果我按照Georgey的建议使用str.contains(pattern),则data ["Found"]的所有行都变为1
If i use str.contains(pattern) as suggested by Georgey, all the rows for data["Found"] becomes 1
data=pd.DataFrame({"id":("a","b","c","d"), "text":("simultaneous there the","simultaneous there","mul why","mul")})
list_of_word=["mul","the"]
pattern = '|'.join(list_of_word)
data["Found"]=np.where(data["text"].str.contains(pattern),1,0)
Output:
id text found
a simultaneous there the 1
b simultaneous there 1
c mul why 1
d mul 1
找到的列中的第二行应为0
The second row in the found column should be 0 here
推荐答案
您可以使用带有生成器表达式的pd.Series.apply
和sum
来做到这一点:
You can do this with pd.Series.apply
and sum
with a generator expression:
import pandas as pd
df = pd.DataFrame({'id': ['a', 'b', 'c'],
'text': ['simultaneous there the', 'simultaneous there', 'mul why']})
test_set = {'mul', 'the'}
df['found'] = df['text'].apply(lambda x: sum(i in test_set for i in x.split()))
# id text found
# 0 a simultaneous there the 1
# 1 b simultaneous there 0
# 2 c mul why 1
上面提供了一个 count .如果只需要布尔值,请使用any
:
The above provides a count. If you just need a Boolean, use any
:
df['found'] = df['text'].apply(lambda x: any(i in test_set for i in x.split()))
对于整数表示,请链接.astype(int)
.
For integer representation, chain .astype(int)
.
这篇关于通过在字符串列中查找确切的单词来创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!