pandas 过滤串联的多个子串 [英] Pandas filtering for multiple substrings in series

查看:60
本文介绍了 pandas 过滤串联的多个子串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要过滤pandas数据框中的行,以便特定的字符串列包含提供的子字符串列表中的至少一个.子字符串可能具有不正常的/正则表达式字符.比较不应该涉及正则表达式,并且不区分大小写.

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

例如:

lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']

我目前这样使用口罩:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]

我的数据帧很大(〜1mio行),lst的长度为100.是否有更有效的方法?例如,如果找到lst中的第一项,我们就不必测试该行的任何后续字符串.

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

推荐答案

如果您坚持使用纯熊猫,那么出于性能和实用性的考虑,我认为您应该 使用正则表达式来完成此任务.但是,您将需要先正确转义子字符串中的任何特殊字符,以确保它们在字面上匹配(并且不用作正则表达式元字符).

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

使用 re.escape 很容易做到:

This is easy to do using re.escape:

>>> import re
>>> esc_lst = [re.escape(s) for s in lst]

然后可以使用正则表达式管道|连接这些转义的子字符串.可以对照一个字符串检查每个子字符串,直到找到一个匹配项(或它们都已经过测试).

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst)

然后,掩蔽阶段变为通过行的单个低级循环:

The masking stage then becomes a single low-level loop through the rows:

df[col].str.contains(pattern, case=False)


这是一个简单的设置,可以让您感觉到性能:


Here's a simple setup to get a sense of performance:

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)

建议的方法大约需要1秒(因此,一百万行最多可能需要20秒):

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop

使用相同的输入数据,问题中的方法花费了大约5秒钟.

The method in the question took approximately 5 seconds using the same input data.

值得注意的是,在没有匹配的情况下,这些时间是最坏的情况"(因此检查了 all 所有子字符串).如果有比赛,则计时会有所改善.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

这篇关于 pandas 过滤串联的多个子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆