性能 - 在 Python 中比较 2 个大型字符串列表的最快方法 [英] Performance - Fastest way to compare 2 large lists of strings in Python

查看:31
本文介绍了性能 - 在 Python 中比较 2 个大型字符串列表的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用 Python 列表,其中一个包含大约 13000 个不允许的短语,另一个包含大约 10000 个句子.

I have to Python lists, one of which contains about 13000 disallowed phrases, and one which contains about 10000 sentences.

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

我需要检查句子列表中的每个句子,看看它是否包含短语列表中的任何短语,如果包含,我想将 ** 放在短语周围并将其添加到另一个列表中.我还需要以最快的方式做到这一点.

I need to check every sentence in the sentences list to see if it contains any phrase from the phrases list, if it does I want to put ** around the phrase and add it to another list. I also need to do this in the fastest possible way.

这是我目前所拥有的:

import re
for sentence in sentences:
    for phrase in phrases:
        if phrase in sentence.lower():
            iphrase = re.compile(re.escape(phrase), re.IGNORECASE)
            newsentence = iphrase.sub("**"+phrase+"**", sentence)
            newlist.append(newsentence)

到目前为止,这种方法大约需要 60 秒才能完成.

So far this approach takes about 60 seconds to complete.

我尝试使用多处理(每个句子的 for 循环被单独映射)但是这产生了更慢的结果.鉴于每个进程的 CPU 使用率约为 6%,因此开销似乎使得将如此小的任务映射到多个内核不值得.我想将句子列表分成更小的块并将它们映射到单独的进程,但还没有完全弄清楚如何实现这一点.

I tried using multiprocessing (each sentence's for loop was mapped separately) however this yielded even slower results. Given that each process was running at about 6% CPU usage, it appears the overhead makes mapping such a small task to multiple cores not worth it. I thought about separating the sentences list into smaller chunks and mapping those to separate processes, but haven't quite figured out how to implement this.

我也考虑过使用 二元搜索算法,但没有无法弄清楚如何将其与字符串一起使用.

I've also considered using a binary search algorithm but haven't been able to figure out how to use this with strings.

从本质上讲,执行此检查的最快方法是什么?

So essentially, what would be the fastest possible way to perform this check?

推荐答案

构建您的正则表达式一次,按最长短语排序,以便您将 ** 包含在最长的匹配短语周围,而不是最短的,执行替换并过滤掉那些没有进行替换的,例如:

Build your regex once, sorting by longest phrase so you encompass the **s around the longest matching phrases rather than the shortest, perform the substitution and filter out those that have no substitution made, eg:

phrases = [
    "phrase1",
    "phrase2",
    "phrase with spaces",
    'can be really really',
    'characters',
    'some sentences'
    # ...
]

sentences = [
    "sentence",
    "some sentences are longer",
    "some sentences can be really really ... really long, about 1000 characters.",
    # ...
]

# Build the regex string required
rx = '({})'.format('|'.join(re.escape(el) for el in sorted(phrases, key=len, reverse=True)))
# Generator to yield replaced sentences
it = (re.sub(rx, r'**\1**', sentence) for sentence in sentences)
# Build list of paired new sentences and old to filter out where not the same
results = [new_sentence for old_sentence, new_sentence in zip(sentences, it) if old_sentence != new_sentence]

给你一个results:

['**some sentences** are longer',
 '**some sentences** **can be really really** ... really long, about 1000 **characters**.']

这篇关于性能 - 在 Python 中比较 2 个大型字符串列表的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆