在 Python 中查找多个子字符串之一的最有效方法是什么? [英] What's the most efficient way to find one of several substrings in Python?

查看:45
本文介绍了在 Python 中查找多个子字符串之一的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可能的子字符串列表,例如['cat', 'fish', 'dog'].实际上,该列表包含数百个条目.

我正在处理一个字符串,我正在寻找的是找到任何这些子字符串第一次出现的索引.

澄清一下,对于'012cat',结果是3,对于'0123dog789cat',结果是4.

我还需要知道找到了哪个子字符串(例如它在子字符串列表中的索引或文本本身),或者至少是匹配的子字符串的长度.

有明显的蛮力方法可以实现这一点,我想知道是否有任何优雅的 Python/regex 解决方案.

解决方案

我认为正则表达式比单独检查每个子字符串更好,因为概念上正则表达式被建模为 DFA,因此当输入被消耗时,所有匹配项都被同时测试(导致输入字符串的一次扫描).

所以,这是一个例子:

导入重新定义工作():to_find = re.compile("cat|fish|dog")search_str = "blah fish cat dog haha​​";match_obj = to_find.search(search_str)the_index = match_obj.start() # 产生5,鱼的索引which_word_matched = match_obj.group() # 鱼";# 注意,如果没有匹配,match_obj 为 None

更新:将单词组合成单一模式的替代单词时应小心谨慎.以下代码构建了一个正则表达式,但 转义任何正则表达式特殊字符 并对单词进行排序,以便较长的单词有机会在同一单词的任何较短前缀之前匹配:

def wordlist_to_regex(words):转义 = 地图(重新转义,单词)组合 = '|'.join(sorted(escaped, key=len, reverse=True))返回重新编译(组合)>>>r.search('粉碎原子粒子').span()(6, 10)>>>r.search('今天访问 usenet:comp.lang.python').span()(13, 29)>>>r.search('北\南分区').span()(2, 13)>>>r.search('012cat').span()(3, 6)>>>r.search('0123dog789cat').span()(4, 7)

结束更新

应该注意的是,您将希望尽可能少地形成正则表达式(即 - 调用 re.compile()).最好的情况是你提前知道你的搜索是什么(或者你计算一次/不频繁)然后将 re.compile 的结果保存在某处.我的例子只是一个简单的废话函数,所以你可以看到正则表达式的用法.这里还有一些正则表达式文档:

http://docs.python.org/library/re.html

希望这会有所帮助.

更新:我不确定python是如何实现正则表达式的,但是回答Rax关于re.compile()是否有限制的问题(例如,您可以尝试将多少个单词|"一起匹配一次),以及运行编译的时间:这些似乎都不是问题.我尝试了这段代码,这足以让我信服.(我本可以通过添加计时和报告结果,以及将单词列表放入一个集合中以确保没有重复来使这更好……但这两种改进似乎都有些矫枉过正).这段代码基本上是即时运行的,让我确信我能够搜索 2000 个单词(大小为 10),并且它们中的一个会适当匹配.代码如下:

随机导入进口重新导入字符串导入系统定义主(参数):单词 = []letters_and_digits = "%s%s";% (string.letters, string.digits)对于我在范围内(2000):字符 = []对于范围内的 j(10):chars.append(random.choice(letters_and_digits))words.append(("%s"*10) % tuple(chars))search_for = re.compile("|".join(words))第一,中间,最后=词[0],词[len(词)/2],词[-1]search_string = "%s, %s, %s";%(最后,中间,第一个)match_obj = search_for.search(search_string)如果 match_obj 是 None:打印啊哈"返回索引 = match_obj.start()which = match_obj.group()如果索引 != 0:打印啊哈"返回如果 words[-1] != 其中:打印ahhg"返回打印成功!!!生成 2000 个随机单词,重新编译,并能够执行匹配."如果 __name__ == __main__":主要(sys.argv)

更新: 应该注意的是,在正则表达式中 ORed 的事物的顺序很重要.看看以下受 TZOTZIOY 启发的测试:

<预><代码>>>>search_str = "01catdog";>>>test1 = re.compile("cat|catdog")>>>match1 = test1.search(search_str)>>>match1.group()'猫'>>>match1.s​​tart()2>>>test2 = re.compile("catdog|cat") # 逆序>>>match2 = test2.search(search_str)>>>match2.group()'猫狗'>>>match2.start()2

这表明顺序很重要:-/.我不确定这对 Rax 的应用程序意味着什么,但至少行为是已知的.

更新:我发布了这个关于在 Python 中实现正则表达式的问题,希望能让我们深入了解这个问题中发现的问题.

I have a list of possible substrings, e.g. ['cat', 'fish', 'dog']. In practice, the list contains hundreds of entries.

I'm processing a string, and what I'm looking for is to find the index of the first appearance of any of these substrings.

To clarify, for '012cat' the result is 3, and for '0123dog789cat' the result is 4.

I also need to know which substring was found (e.g. its index in the substring list or the text itself), or at least the length of the substring matched.

There are obvious brute-force ways to achieve this, I wondered if there's any elegant Python/regex solution for this.

解决方案

I would assume a regex is better than checking for each substring individually because conceptually the regular expression is modeled as a DFA, and so as the input is consumed all matches are being tested for at the same time (resulting in one scan of the input string).

So, here is an example:

import re

def work():
  to_find = re.compile("cat|fish|dog")
  search_str = "blah fish cat dog haha"
  match_obj = to_find.search(search_str)
  the_index = match_obj.start()  # produces 5, the index of fish
  which_word_matched = match_obj.group()  # "fish"
  # Note, if no match, match_obj is None

UPDATE: Some care should be taken when combining words in to a single pattern of alternative words. The following code builds a regex, but escapes any regex special characters and sorts the words so that longer words get a chance to match before any shorter prefixes of the same word:

def wordlist_to_regex(words):
    escaped = map(re.escape, words)
    combined = '|'.join(sorted(escaped, key=len, reverse=True))
    return re.compile(combined)

>>> r.search('smash atomic particles').span()
(6, 10)
>>> r.search('visit usenet:comp.lang.python today').span()
(13, 29)
>>> r.search('a north\south division').span()
(2, 13)
>>> r.search('012cat').span()
(3, 6)
>>> r.search('0123dog789cat').span()
(4, 7)

END UPDATE

It should be noted that you will want to form the regex (ie - call to re.compile()) as little as possible. The best case would be you know ahead of time what your searches are (or you compute them once/infrequently) and then save the result of re.compile somewhere. My example is just a simple nonsense function so you can see the usage of the regex. There are some more regex docs here:

http://docs.python.org/library/re.html

Hope this helps.

UPDATE: I am unsure about how python implements regular expressions, but to answer Rax's question about whether or not there are limitations of re.compile() (for example, how many words you can try to "|" together to match at once), and the amount of time to run compile: neither of these seem to be an issue. I tried out this code, which is good enough to convince me. (I could have made this better by adding timing and reporting results, as well as throwing the list of words into a set to ensure there are no duplicates... but both of these improvements seem like overkill). This code ran basically instantaneously, and convinced me that I am able to search for 2000 words (of size 10), and that and of them will match appropriately. Here is the code:

import random
import re
import string
import sys

def main(args):
    words = []
    letters_and_digits = "%s%s" % (string.letters, string.digits)
    for i in range(2000):
        chars = []
        for j in range(10):
            chars.append(random.choice(letters_and_digits))
        words.append(("%s"*10) % tuple(chars))
    search_for = re.compile("|".join(words))
    first, middle, last = words[0], words[len(words) / 2], words[-1]
    search_string = "%s, %s, %s" % (last, middle, first)
    match_obj = search_for.search(search_string)
    if match_obj is None:
        print "Ahhhg"
        return
    index = match_obj.start()
    which = match_obj.group()
    if index != 0:
        print "ahhhg"
        return
    if words[-1] != which:
        print "ahhg"
        return

    print "success!!! Generated 2000 random words, compiled re, and was able to perform matches."

if __name__ == "__main__":
    main(sys.argv)

UPDATE: It should be noted that the order of of things ORed together in the regex matters. Have a look at the following test inspired by TZOTZIOY:

>>> search_str = "01catdog"
>>> test1 = re.compile("cat|catdog")
>>> match1 = test1.search(search_str)
>>> match1.group()
'cat'
>>> match1.start()
2
>>> test2 = re.compile("catdog|cat")  # reverse order
>>> match2 = test2.search(search_str)
>>> match2.group()
'catdog'
>>> match2.start()
2

This suggests the order matters :-/. I am not sure what this means for Rax's application, but at least the behavior is known.

UPDATE: I posted this questions about the implementation of regular expressions in Python which will hopefully give us some insight into the issues found with this question.

这篇关于在 Python 中查找多个子字符串之一的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆