在 vs 正则表达式中检查单词黑名单——估计预期的生产性能 [英] In vs regular expressions for checking against a word blacklist -- estimating expected production performance

查看:63
本文介绍了在 vs 正则表达式中检查单词黑名单——估计预期的生产性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多 HTML 页面,我需要在其中检查是否存在列入黑名单的单词.我知道内置的 in 比正则表达式快得多,但在这里我试图将许多 in 与单个正则表达式进行比较.

自从

<块引用>

re.match() 只检查字符串开头的匹配

我使用了类似于 .*(word|word...) 的正则表达式,并用空格替换了换行符.

<小时>

代码

from timeit import timeit进口重新从 urllib2 导入 urlopenhtml = urlopen('http://en.wikipedia.org/wiki/Main_Page').read()# 随机反转字符串以避免不需要的匹配 + 一个安全匹配单词 = ["zihw", "elbadartnu", "retlob", "ssenenif", "nnub", "detartsehcro","elbappirnu", "banehc", "rebmunbus", "gnizilodi", "noituac", "deludehcsnu","/body", "latnanosnocerp", "cihportomeh"]def in_test(html, 黑名单):html_lower = html.lower()返回任何(在 html_lower 中为黑名单中的 k)def search_test(html, 模式):如果研究(模式,html):返回真返回错误def match_test(html, 模式):html_line = html.replace("\r\n", " ").replace("\r", " ").replace("\n", " ")如果重新匹配(模式,html_line):返回真返回错误# patternX 是 word|word|word... patternX_exc 是 .*(word|word|...)pattern5 = re.compile("|".join(words[:5]), re.I)pattern5_exc = re.compile(".*(" + "|".join(words[:5]) + ")", re.I)pattern10 = re.compile("|".join(words[:10]), re.I)pattern10_exc = re.compile(".*(" + "|".join(words[:10]) + ")", re.I)pattern15a = re.compile("|".join(words[:15]), re.I)pattern15a_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)words[12] = "doctype" # 页面开头的安全匹配pattern15b = re.compile("|".join(words[:15]), re.I)pattern15b_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)words[12] = "featured list" # ~半页的安全匹配pattern15c = re.compile("|".join(words[:15]), re.I)pattern15c_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)

<小时>

in vs re.match vs re.search 没有匹配

print timeit("in_test(html, words[:5])", "from __main__ import *")print timeit("search_test(html, pattern5)", "from __main__ import *")打印时间(match_test(html,pattern5_exc)",从__main__导入*")0.1273970603942.050209999082.17416286469print timeit("in_test(html, words[:10])", "from __main__ import *")打印时间(search_test(html,pattern10)",从__main__导入*")打印时间(match_test(html,pattern10_exc)",从__main__导入*")0.2103240489963.735446929933.8765540123

这些测试不匹配任何单词.in 显然是赢家,速度似乎随着字数的增加呈线性增长.

<小时>

in vs re.match vs re.search 在不同位置匹配

print timeit("in_test(html, words[:15])", "from __main__ import *")# 匹配到最后打印时间(search_test(html,pattern15a)",从__main__导入*")打印时间(match_test(html,pattern15a_exc)",从__main__导入*")# 匹配开头打印时间(search_test(html,pattern15b)",从__main__导入*")打印时间(match_test(html,pattern15b_exc)",从__main__导入*")# 匹配~半页打印时间(search_test(html,pattern15c)",从__main__导入*")打印时间(match_test(html,pattern15c_exc)",从__main__导入*")

输出是

0.2583329677585.90744209290.04332995414730.0007708072662356.05482101442.478159904483.25421690941

当匹配发生时,正则表达式可以比 in 快得多,但这取决于匹配的位置.开始时 re.search 会更好,最后 re.match 是更好的选择,在 ~half page 两者都会明显慢于 in.

<小时>

正则表达式将帮助我不重复单词(例如.è&egrave;、...),让我忘记大写/小写(尤其是非 ascii 字符).但是速度好像变化太大了,平均来说比in还慢.

这些测试是否正确?如果是这样,是否有任何其他内置方法可以测试或其他程序可以帮助我在这种情况下?黑名单会越来越多,所以我需要考虑到这一点.

解决方案

一般问题

它有一个时空权衡:

关于比较两种方法的具体问题:

测试被正确解释 - 对于它们执行的材料.但是,正如我所说,这两种方法的性能(因此,相对性能)在很大程度上取决于词表大小、输入大小和输入本身以及正则表达式的最优性.

建议的行动方案

因此,您应该对一些模拟典型用例的现实示例进行测试.即

  • 如果您打算在生产中以相同的方式优化正则表达式
  • 取多个文档的平均值,其中
    • 匹配的百分比
    • 比赛地点分布
    • 词表词的相对出现率

与生产中的预期相同.

我建议也测试哈希表:它具有更大的初始开销,但在大型输入和/或词表上,它应该开始优于其他两个.

为避免重复单词,您可能需要在搜索之前尝试使用清理输入(小写、&-seq 替换)的方法.同样,这是在达到一定规模后开始获得回报的额外开销.

使用数学最小化测试数据

如果假设匹配位置是均匀分布的并且词表词具有相等的出现率,则测试数据可以简化为:

  1. 没有匹配的文本,并且没有很多以单词表单词开头的单词(最佳典型"输入)
  2. 没有匹配但完全由与单词列表单词相同的单词开头的文本,heads"在单词列表中大致均匀分布(两种方法的最差"输入 - 参见 1)如果在生产;2) 这种情况会对最终结果产生多大的影响)
  3. 中途匹配的文本,其中在单词列表中定位单词大约需要文本和正则表达式匹配器的一半工作,并且没有许多其他以单词列表单词开头的单词

那么,最终的预期平均"时间:

Txa = (Tn+(Ts-Tn)*Ps)*Pn + (Tm+((Ts-Tm)*Ps)/2)*Pm

其中 T - 次数,P - 预期概率;n - 没有匹配的输入,s - 以单词列表开头的单词(慢),m - 有匹配的输入.

I have many HTML pages in which I need to check the existence of blacklisted words. I know that the built-in in is much faster then regular expressions but here I'm trying to compare many in against a single regular expression.

Since

re.match() checks for a match only at the beginning of the string

I used a regex similar to .*(word|word...) and I replaced newline characters with a space.


Code

from timeit import timeit
import re
from urllib2 import urlopen

html = urlopen('http://en.wikipedia.org/wiki/Main_Page').read()

# Random reversed strings to avoid unwanted match + one secure match
words = [
    "zihw","elbadartnu", "retlob", "ssenenif", "nnub", "detartsehcro",
    "elbappirnu", "banehc", "rebmunbus", "gnizilodi", "noituac", "deludehcsnu",
    "/body", "latnanosnocerp", "cihportomeh"
]


def in_test(html, blacklist):
    html_lower = html.lower()
    return any(k in html_lower for k in blacklist):


def search_test(html, pattern):
    if re.search(pattern, html):
        return True
    return False


def match_test(html, pattern):
    html_line = html.replace("\r\n", " ").replace("\r", " ").replace("\n", " ")
    if re.match(pattern, html_line):
        return True
    return False


# patternX is word|word|word... patternX_exc is .*(word|word|...)
pattern5 = re.compile("|".join(words[:5]), re.I)
pattern5_exc = re.compile(".*(" + "|".join(words[:5]) + ")", re.I)

pattern10 = re.compile("|".join(words[:10]), re.I)
pattern10_exc = re.compile(".*(" + "|".join(words[:10]) + ")", re.I)

pattern15a = re.compile("|".join(words[:15]), re.I)
pattern15a_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)

words[12] = "doctype"  # A secure match at the beginning of the page
pattern15b = re.compile("|".join(words[:15]), re.I)
pattern15b_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)

words[12] = "featured list"  # A secure match at ~half page
pattern15c = re.compile("|".join(words[:15]), re.I)
pattern15c_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)


in vs re.match vs re.search with no match

print timeit("in_test(html, words[:5])", "from __main__ import *")
print timeit("search_test(html, pattern5)", "from __main__ import *")
print timeit("match_test(html, pattern5_exc)", "from __main__ import *")

0.127397060394
2.05020999908
2.17416286469


print timeit("in_test(html, words[:10])", "from __main__ import *")
print timeit("search_test(html, pattern10)", "from __main__ import *")
print timeit("match_test(html, pattern10_exc)", "from __main__ import *")

0.210324048996
3.73544692993
3.8765540123

These tests don't match any words. in is clearly the winner and the speed seems to increase linearly with the number of words.


in vs re.match vs re.search with match in different position

print timeit("in_test(html, words[:15])", "from __main__ import *")

# Match at the end
print timeit("search_test(html, pattern15a)", "from __main__ import *")
print timeit("match_test(html, pattern15a_exc)", "from __main__ import *")

# Match at the beginning
print timeit("search_test(html, pattern15b)", "from __main__ import *")
print timeit("match_test(html, pattern15b_exc)", "from __main__ import *")

# Match at ~half page
print timeit("search_test(html, pattern15c)", "from __main__ import *")
print timeit("match_test(html, pattern15c_exc)", "from __main__ import *")

The output is

0.258332967758

5.9074420929
0.0433299541473

0.000770807266235
6.0548210144

2.47815990448
3.25421690941

When a match occurs, the regex can be much faster than in but it depends on the position of the match. At the beginning, re.search will be better, at the end re.match is the better choice, at ~half page both will be significantly slower than in.


Regular expressions will help me not to duplicate words (eg. è, &egrave;, ...), and let me forget about upper/lowercase (especially with non ascii characters). But the speed seems to be too much variable and, on average, slower than in.

Are these tests correct? If so, are there any other built-in method that I can test or other procedures that will help me in this scenario? The blacklist is going to grow, so I need to take this into account.

解决方案

The problem in general

It has a Time-space tradeoff:

  • The fastest possible (and most memory-demanding) solution is an N-nary tree (where N is the number of letters in alphabet). Each node has N pointers, each of those is non-null if there are words in the list with that letter as the next, and a flag which is set is there's a word that ends here.
  • Another fast implementation with much smaller footprint is the one behind T9 lookup.
  • a hash table (set in this case since you're only interested if a key is present) has larger overhead (hash calculations, conflict-related operations) but scales extremely well since it has almost constant lookup time in a typical case. Python's mapping types implementation automatically adjusts hash table's size to keep the potentially unlimited conflict-related overhead in check.
  • A regex (preferably optimized by minimizing the amount of backtracking to make) has negligible footprint but is slow since python uses a regex-directed engine that walks the text many times: it's text-directed one like the one of egrep that's more suited for the task. Others factor is its working time highly depends on the input (sometimes, catastrophically) and it doesn't scale well as the wordlist grows.
  • comparing against a list of words is essentially a primitive kind of text-directed regex engine. It doesn't do backtracking but has larger comparison and list-walking overhead. It may be faster or slower than a regex depending of how these overheads compare.

The specific question about comparing the two methods:

The tests are interpreted correctly - for the material they're performed on. But, as I said, both these methods' performance (hence, relative performance) depends dramatically on the wordlist size, input size and input itself and regex optimality.

Suggested course of action

So you ought to make tests on some realistic examples that model the typical use case. I.e.

  • optimize the regex if and in the same way you intend to in production
  • take the average on several documents, where
    • the percentage of those with matches
    • the distribution of match locations
    • the relative occurence rate of wordlist words

is the same as expected to be in production.

I'd suggest to test the hash-table as well: it has larger initial overhead but on a large input and/or wordlist it should start outperforming the other two.

To avoid duplicating words, you may want to try out the methods with sanitizing input (lowercasing, &-seq replacements) prior to searching. Again, this is additional overhead that starts to pay off after a certain scale.

Minimizing the test data by using math

If match locations are presumed to be uniformly distributed and wordlist words to have equal occurence rate, the test data can be simplified to:

  1. a text without match and without many words that start like wordlist words (the "best typical" input)
  2. a text without match but entirely of words that start the same way as wordlist words, "heads" distributed roughly uniformly across the wordlist (the "worst" input for both methods - to see 1) if there can be catastrophic failures in production; 2) how much will such cases skew the end result)
  3. a text with a match half-way, where locating the word in the wordlist requires roughly half the work from both text and regex matcher and without many other words that start like wordlist words

Then, the final "expected average" time:

Txa = (Tn+(Ts-Tn)*Ps)*Pn + (Tm+((Ts-Tm)*Ps)/2)*Pm

where T - times, P - expected probabilities; n - of input without match, s - (slow) of words that start like wordlist ones, m - of input with match.

这篇关于在 vs 正则表达式中检查单词黑名单——估计预期的生产性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆