pandas 将列内容与关键字匹配(带有空格和方括号) [英] Pandas to match column contents to keywords (with spaces and brackets )

查看:67
本文介绍了 pandas 将列内容与关键字匹配(带有空格和方括号)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据框中的一列包含我要匹配的关键字.

A columns in data frame contains the keywords I want to match with.

我想检查每一列是否包含任何关键字.如果是,请打印它们.

I want to check if each column contains any of the keywords. If yes, print them.

尝试以下:

import pandas as pd
import re

Keywords = [

"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]

data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}

df = pd.DataFrame(data)

pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)

df["found"] = df['People'].str.findall(pat).str.join('; ')

print df["found"]

它返回Nan.我想挑战在于关键字中的空格和方括号.

It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.

获得理想输出的正确方法是什么?谢谢.

What's the right way to get the ideal outputs? Thank you.

Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q

推荐答案

由于您不需要查找每个关键字,但是最长的关键字(如果它们重叠的话)可以使用带 findall 方法.

Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.

这里的要点是,您需要先按长度降序对关键字进行排序(因为它们中包含空格),然后您需要对这些值进行转义,因为它们包含特殊字符,然后必须修改单词边界使用明确的单词边界,(?<!\ w)(?!\ w)(请注意, \ b与上下文相关).

The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).

使用

pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))

请参见在线Python测试:

import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
    print(re.findall(rx, s))

输出

['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']

这篇关于 pandas 将列内容与关键字匹配(带有空格和方括号)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆