如果项目包含来自“黑名单"的子字符串,则从列表中删除该项目. [英] Delete item from list if it contains a substring from a "blacklist"

查看:70
本文介绍了如果项目包含来自“黑名单"的子字符串,则从列表中删除该项目.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python中,我想从列表中删除包含在所谓的黑名单"中找到的子字符串的任何字符串.

In python, I'd like to remove from a list any string which contains a substring found in a so called "blacklist".

例如,假设列表A为以下内容:

For example, assume list A is the following:

A = [ 'cat', 'doXXXg', 'monkey', 'hoBBBrse', 'fish', 'snake']

且列表B为:

B = ['XXX', 'BBB']

我如何获得列表C:

C = [ 'cat', 'monkey', 'fish', 'snake']

我玩过各种正则表达式和列表理解的组合,但是我似乎无法使其正常工作.

I've played around with various combinations of regex expressions and list comprehensions but I can't seem to get it to work.

推荐答案

您可以将黑名单加入一个表达式:

You could join the blacklist into one expression:

import re

blacklist = re.compile('|'.join([re.escape(word) for word in B]))

然后将匹配的单词过滤掉:

then filter words out if they match:

C = [word for word in A if not blacklist.search(word)]

模式中的单词被转义(这样就不会这样对待.和其他元字符,而是将它们视为文字字符),并加入一系列|替代项:

Words in the pattern are escaped (so that . and other meta characters are not treated as such, but as literal characters instead), and joined into a series of | alternatives:

>>> '|'.join([re.escape(word) for word in B])
'XXX|BBB'

演示:

>>> import re
>>> A = [ 'cat', 'doXXXg', 'monkey', 'hoBBBrse', 'fish', 'snake']
>>> B = ['XXX', 'BBB']
>>> blacklist = re.compile('|'.join([re.escape(word) for word in B]))
>>> [word for word in A if not blacklist.search(word)]
['cat', 'monkey', 'fish', 'snake']

这应该胜过任何明确的成员资格测试,尤其是随着黑名单中单词数量的增加:

This should outperform any explicit membership testing, especially as the number of words in your blacklist grows:

>>> import string, random, timeit
>>> def regex_filter(words, blacklist):
...     [word for word in A if not blacklist.search(word)]
... 
>>> def any_filter(words, blacklist):
...     [word for word in A if not any(bad in word for bad in B)]
... 
>>> words = [''.join([random.choice(string.letters) for _ in range(random.randint(3, 20))])
...          for _ in range(1000)]
>>> blacklist = [''.join([random.choice(string.letters) for _ in range(random.randint(2, 5))])
...              for _ in range(10)]
>>> timeit.timeit('any_filter(words, blacklist)', 'from __main__ import any_filter, words, blacklist', number=100000)
0.36232495307922363
>>> timeit.timeit('regex_filter(words, blacklist)', "from __main__ import re, regex_filter, words, blacklist; blacklist = re.compile('|'.join([re.escape(word) for word in blacklist]))", number=100000)
0.2499098777770996

上面的方法针对1000个随机单词(长度为3-20个字符)列表测试了10个随机列入黑名单的短单词(2-5个字符),正则表达式的速度提高了约50%.

The above tests 10 random blacklisted short words (2 - 5 characters) against a list of 1000 random words (3 - 20 characters long), the regex is about 50% faster.

这篇关于如果项目包含来自“黑名单"的子字符串,则从列表中删除该项目.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆