Python中的高性能海量短字符串搜索 [英] High performance mass short string search in Python

查看:69
本文介绍了Python中的高性能海量短字符串搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:一个大的静态字符串列表以A提供,一个长字符串以B提供,A中的字符串为都很短(关键字列表),我想检查 A 中的每个字符串是否都是 B 的子字符串并获取它们.

现在我使用一个简单的循环:

result = []对于 A 中的单词:如果 B 中的单词:结果.附加(字)

但是当 A 包含大约 500,000 个或更多项目时,它会非常慢.

有没有适合这个问题的库或算法?我已尽力搜索,但没有运气.

谢谢!

解决方案

您的问题足够大,您可能需要使用算法蝙蝠来解决它.

查看 Aho-Corasick 算法.您的问题陈述是对该算法解决的问题的转述.

此外,请查看 Nicholas Lehuen 使用他的 PyTST 包所做的工作.>

在相关的 Stack Overflow 消息中也有提及其他算法,例如 Rabin-Karp:线性模式匹配算法?

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.

Now I use a simple loop like:

result = []
for word in A:
    if word in B:
        result.append(word)

But it's crazy slow when A contains ~500,000 or more items.

Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.

Thank you!

解决方案

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

这篇关于Python中的高性能海量短字符串搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆