从列表中搜索数据框列中的部分字符串匹配-Pandas-Python [英] Search for a partial string match in a data frame column from a list - Pandas - Python

查看:278
本文介绍了从列表中搜索数据框列中的部分字符串匹配-Pandas-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个列表:

things = ['A1','B2','C3']

我有一个熊猫数据框,其中的一列包含用分号分隔的值-有些行将包含与上面列表中的一项匹配的匹配项(由于它具有其他部分,因此不是完美匹配项)列中的字符串.例如,该列中的行可能具有哇;这里;这= A1 ; 10001; 0')

I have a pandas data frame with a column containing values separated by a semicolon - some of the rows will contain matches with one of the items in the list above (it won't be a perfect match since it has other parts of a string in the column.. for example, a row in that column may have 'Wow;Here;This=A1;10001;0')

我要保存包含与列表中的项目匹配的行,然后使用这些选定的行(应具有相同的标题)创建一个新的数据框.这是我尝试过的:

I want to save the rows that contain a match with items from the list, and then create a new data frame with those selected rows (should have the same headers). This is what I tried:

import re

for_new_df =[]

for x in df['COLUMN']:
    for mp in things:
        if df[df['COLUMN'].str.contains(mp)]:
            for_new_df.append(mp)  #This won't save the whole row - help here too, please.

这段代码给我一个错误:

This code gave me an error:

ValueError:DataFrame的真值不明确.使用a.empty,a.bool(),a.item(),a.any()或a.all().

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我对编码非常陌生,因此答案中的解释和细节越多越好!预先感谢.

I'm very new to coding, so the more explanation and detail in your answer, the better! Thanks in advance.

推荐答案

您可以通过加入单词列表来创建正则表达式并使用str.contains来避免循环:

You can avoid the loop by joining your list of words to create a regex and use str.contains:

pat = '|'.join(thing)
for_new_df = df[df['COLUMN'].str.contains(pat)]

应该工作

因此正则表达式模式变为:'A1|B2|C3',它将与包含这些字符串中的任何一个的字符串中的任何地方匹配

So the regex pattern becomes: 'A1|B2|C3' and this will match anywhere in your strings that contain any of these strings

示例:

In [65]:
things = ['A1','B2','C3']
pat = '|'.join(things)
df = pd.DataFrame({'a':['Wow;Here;This=A1;10001;0', 'B2', 'asdasda', 'asdas']})
df[df['a'].str.contains(pat)]

Out[65]:
                          a
0  Wow;Here;This=A1;10001;0
1                        B2

关于失败的原因:

if df[df['COLUMN'].str.contains(mp)]

此行:

df[df['COLUMN'].str.contains(mp)]

返回由内部str.contains的布尔数组掩盖的df,if不理解如何评估布尔数组,因此会出错.如果您考虑一下,如果您选择1个True或除1个True之外的所有商品,应该怎么办?它需要一个标量,而不是像value这样的数组.

returns a df masked by the boolean array of your inner str.contains, if doesn't understand how to evaluate an array of booleans hence the error. If you think about it what should it do if you 1 True or all but one True? it expects a scalar and not an array like value.

这篇关于从列表中搜索数据框列中的部分字符串匹配-Pandas-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆