通过在字符串列中查找确切的单词来创建新列 [英] Creating a new column by finding exact word in a column of strings

查看:99
本文介绍了通过在字符串列中查找确切的单词来创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果列表中的任何单词与dataframe字符串列完全匹配,我想用1或0创建一个新列.

I want to create a new column with 1 or 0, if any of the words in a list is matched exaclty with the dataframe string column.

list_provided=["mul","the"]
#how my dataframe looks
id  text
a    simultaneous there the
b    simultaneous there
c    mul why

预期输出

id  text                     found
a    simultaneous there the   1
b    simultaneous there       0
c    mul why                  1

第二行被分配为0,因为 "mul"或"the"在字符串列"text"中都不完全匹配

Second row is assigned 0, since either of "mul" or "the" are not exactly matching in the string column "text"

代码尝试到现在为止

#For exact match I am using the below code
data["Found"]=np.where(data["text"].str.contains(r'(?:\s|^)penalidades(?:\s|$)'),1,0)

如何遍历循环以查找提供的单词列表中所有单词的完全匹配?

How can I iterate through a loop to find exact match of all the words in the provided list of words?

修改: 如果我按照Georgey的建议使用str.contains(pattern),则data ["Found"]的所有行都变为1

If i use str.contains(pattern) as suggested by Georgey, all the rows for data["Found"] becomes 1

data=pd.DataFrame({"id":("a","b","c","d"), "text":("simultaneous there the","simultaneous there","mul why","mul")})
list_of_word=["mul","the"]
pattern = '|'.join(list_of_word)
data["Found"]=np.where(data["text"].str.contains(pattern),1,0)

Output:
id  text                     found
a    simultaneous there the   1
b    simultaneous there       1
c    mul why                  1
d    mul                      1

找到的列中的第二行应为0

The second row in the found column should be 0 here

推荐答案

您可以使用带有生成器表达式的pd.Series.applysum来做到这一点:

You can do this with pd.Series.apply and sum with a generator expression:

import pandas as pd

df = pd.DataFrame({'id': ['a', 'b', 'c'],
                   'text': ['simultaneous there the', 'simultaneous there', 'mul why']})

test_set = {'mul', 'the'}

df['found'] = df['text'].apply(lambda x: sum(i in test_set for i in x.split()))

#   id                    text  found
# 0  a  simultaneous there the      1
# 1  b      simultaneous there      0
# 2  c                 mul why      1


上面提供了一个 count .如果只需要布尔值,请使用any:


The above provides a count. If you just need a Boolean, use any:

df['found'] = df['text'].apply(lambda x: any(i in test_set for i in x.split()))

对于整数表示,请链接.astype(int).

For integer representation, chain .astype(int).

这篇关于通过在字符串列中查找确切的单词来创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆