如何检查Pandas行中是否包含列表的完整字符串或子字符串? [英] How to check if Pandas rows contain any full string or substring of a list?
问题描述
我有一个字符串列表
list_ = ['abc', 'def', 'xyz']
我有一个 df
和 CheckCol
列,我想检查 CheckCol
中的值是否包含整个子字符串中的任何一个列表元素.
And I have a df
with column CheckCol
, that I want to check if the values in CheckCol
contains any of the whole of substring of the list element.
如果这样做,我想将原始值提取到新列 NewCol
中.
If it does, I want to extract the original value into a new column NewCol
.
CheckCol
'a'
'ab'
'abc'
'abc-de'
进入
# What I want
CheckCol NewCol
'a'
'ab'
'abc' 'abc'
'abc-de' 'abc-de'
但是,我的以下代码只能识别确切的完整字符串,而不能识别我想要的子字符串.
My following codes, however, only recognize the exact full string, but not the substrings I was looking for.
df['NewCol'] = np.where(df['CheckCol'].isin(list_), df['CheckCol'], '')
并给出
# What I get
CheckCol NewCol
'a'
'ab'
'abc' 'abc'
'abc-de'
将列表
更改为 list _
推荐答案
我认为实现最简单"的解决方案是使用regex表达式.在正则表达式中,管道 |
表示or.通过执行'|'.join(yourlist)
,我们获得了要检查的子字符串.
I think the "easiest" implemented solution would be to use a regex-expression. In regex the pipe |
means or. By doing '|'.join(yourlist)
we get the substrings we want to check.
import pandas as pd
import numpy as np
list_ = ['abc', 'def', 'xyz']
df = pd.DataFrame({
'CheckCol': ['a','ab','abc','abd-def']
})
df['NewCol'] = np.where(df['CheckCol'].str.contains('|'.join(list_)), df['CheckCol'], '')
print(df)
# CheckCol NewCol
#0 a
#1 ab
#2 abc abc
#3 abd-def abd-def
注意::您的变量名称 list
已更改为 list _
.尝试避免使用保留的Python名称空间.
NOTE: Your variable name list
was changed to list_
. Try to avoid using the reserved Python namespace.
这篇关于如何检查Pandas行中是否包含列表的完整字符串或子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!