在大字符串文件中查找部分字符串匹配的最有效方法(python) [英] most efficient way to find partial string matches in large file of strings (python)

查看:660
本文介绍了在大字符串文件中查找部分字符串匹配的最有效方法(python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下载了Wikipedia文章标题文件,其中包含每个Wikipedia文章的名称.我需要搜索可能匹配的所有文章标题.例如,我可能有单词"hockey",但是我想要的关于曲棍球的维基百科文章是"Ice_hockey".它也应该是不区分大小写的搜索.

I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.

我正在使用Python,是否有比仅逐行搜索更有效的方法?理想情况下,我将以每分钟500或1000次的速度执行此搜索.如果逐行是我唯一的选择,那么我可以在其中进行一些优化吗?

I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?

我认为文件中有几百万行.

I think there are several million lines in the file.

有什么想法吗?

谢谢.

推荐答案

如果要匹配单个单词,格雷格的答案很好.如果要匹配子字符串,则需要更复杂的东西,例如后缀树(http://en.wikipedia.org/wiki/Suffix_tree).一旦构建完成,后缀树就可以有效地回答对任意子字符串的查询,因此在您的示例中,当有人搜索"hock"时,它可以匹配"Ice_Hockey".

Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".

这篇关于在大字符串文件中查找部分字符串匹配的最有效方法(python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆