仅当文本包含白名单中的所有单词,但不包含黑名单中的所有单词时才匹配文本 [英] Match text only if it contains all words from whitelist, but none from blacklist

查看:47
本文介绍了仅当文本包含白名单中的所有单词,但不包含黑名单中的所有单词时才匹配文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过这个例子更容易理解我想要实现的目标:

假设我们有这个白名单:一二三.还有这个黑名单:四五.然后:

  • 三一二为匹配文本(包含所有白名单词);
  • 一三二六为匹配文本(包含所有白名单词);
  • two one 不是匹配的文本(缺少白名单词three);
  • 一四二三不是匹配的文本(包含黑名单词four).

有人能帮我解决这个案例的正则表达式吗?

解决方案

这不是您想要使用正则表达式的东西.最好这样做(Python 中的示例):

<预><代码>>>>白名单 = [一"、二"、三"]>>>blacklist = [四",五"]>>>texts = ["三二一", "一三二六", "一二", "一二三四"]>>>对于文本中的文本:... mytext = text.split()... if all(word in mytext for word in whitelist) and \...不是任何(我文本中的单词黑名单中的单词):... 打印(文本)...三二一一三二六>>>

可以做到这一点:

^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\b五\b)

  • ^ 将搜索锚定在字符串的开头.
  • (?=...) 保证它的内容可以从当前位置开始匹配
  • (?!...) 确保其内容无法从当前位置匹配
  • \bone\b 匹配 one 但不匹配 lonely.

所以你得到:

<预><代码>>>>进口重新>>>r = re.compile(r"^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\b5\b)")>>>对于文本中的文本:...如果 r.match(text):... 打印(文本)...三二一一三二六

I guess it'll be easier to understand what I'm trying to achieve with this example:

Let's say we have this whitelist: one two three. And this blacklist: four five. Then:

  • three one two is a matching text (contains all whitelist words);
  • one three two six is a matching text (contains all whitelist words);
  • two one is not a matching text (lacks a whitelist word three);
  • one four two three is not a matching text (contains a blacklist word four).

Could anyone help me out with a regex for this case?

解决方案

This is not something you'd want to use a regex for. Better do it like this (example in Python):

>>> whitelist = ["one", "two", "three"]
>>> blacklist = ["four", "five"]
>>> texts = ["three two one", "one three two six", "one two", "one two three four"]
>>> for text in texts:
...     mytext = text.split()
...     if all(word in mytext for word in whitelist) and \
...        not any(word in mytext for word in blacklist):
...         print(text)
...
three two one
one three two six
>>>

You can do it, though:

^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)

  • ^ anchors the search at the start of the string.
  • (?=...) ensures that its contents can be matched from the current position
  • (?!...) ensures that its contents can't be matched from the current position
  • \bone\b matches one but not lonely.

So you get:

>>> import re
>>> r = re.compile(r"^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)")
>>> for text in texts:
...     if r.match(text):
...         print(text)
...
three two one
one three two six

这篇关于仅当文本包含白名单中的所有单词,但不包含黑名单中的所有单词时才匹配文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆