Pandas 过滤多个子串串联 [英] Pandas filtering for multiple substrings in series

查看:22
本文介绍了Pandas 过滤多个子串串联的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要过滤 pandas 数据框中的行,以便特定字符串列至少包含提供的子字符串列表中的一个.子字符串可能有不寻常的/正则表达式字符.比较不应涉及正则表达式,并且不区分大小写.

例如:

lst = ['kdSj;af-!?', 'aBC+dsfa?-', 'sdKaJg|dksaf-*']

我现在这样敷面膜:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])df = df[掩码]

我的数据框很大(约 1mio 行)并且 lst 的长度为 100.有没有更有效的方法?例如,如果找到了 lst 中的第一项,我们就不必测试该行的任何后续字符串.

解决方案

如果您坚持使用纯熊猫,为了性能和实用性,我认为您应该使用正则表达式来完成这项任务.但是,您首先需要正确转义子字符串中的任何特殊字符,以确保它们按字面匹配(而不是用作正则表达式元字符).

这很容易使用 re.escape:

<预><代码>>>>进口重新>>>esc_lst = [re.escape(s) for s in lst]

然后可以使用正则表达式管道 | 连接这些转义的子字符串.可以根据字符串检查每个子字符串,直到匹配(或者它们都已被测试).

<预><代码>>>>模式 = '|'.join(esc_lst)

掩蔽阶段然后变成通过行的单个低级循环:

df[col].str.contains(pattern, case=False)

<小时>

这里有一个简单的设置来了解性能:

from random import randint, 种子种子(321)# 100 个 5 个字符的子串lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]# 50000 个 20 个字符的字符串字符串 = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]col = pd.Series(strings)esc_lst = [re.escape(s) for s in lst]模式 = '|'.join(esc_lst)

建议的方法大约需要 1 秒(因此对于 100 万行可能最多需要 20 秒):

%timeit col.str.contains(pattern, case=False)1 个循环,最好的 3 个:每个循环 981 毫秒

问题中的方法使用相同的输入数据大约需要 5 秒钟.

值得注意的是,这些时间是最坏情况",因为没有匹配项(因此检查了所有子字符串).如果有比赛,时间会有所改善.

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

For example:

lst = ['kdSj;af-!?', 'aBC+dsfa?-', 'sdKaJg|dksaf-*']

I currently apply the mask like this:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst])
df = df[mask]

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

解决方案

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

This is easy to do using re.escape:

>>> import re
>>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst)

The masking stage then becomes a single low-level loop through the rows:

df[col].str.contains(pattern, case=False)


Here's a simple setup to get a sense of performance:

from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop

The method in the question took approximately 5 seconds using the same input data.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

这篇关于Pandas 过滤多个子串串联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆