Python:在大型字符串语料库中查找部分字符串匹配 [英] Python: Finding partial string matches in a large corpus of strings

查看:72
本文介绍了Python:在大型字符串语料库中查找部分字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对在 Python 中实现自动完成感兴趣.例如,当用户输入字符串时,我想显示磁盘上名称以该字符串开头的文件子集.

I'm interested in implementing autocomplete in Python. For example, as the user types in a string, I'd like to show the subset of files on disk whose names start with that string.

在大型语料库(例如几十万个字符串)中查找匹配某些条件的字符串的有效算法是什么?类似的东西:

What's an efficient algorithm for finding strings that match some condition in a large corpus (say a few hundred thousand strings)? Something like:

matches = [s for s in allfiles if s.startswith(input)]

我希望条件是灵活的;例如.而不是严格的开始,只要输入中的所有字母以相同的顺序出现在 s 中,它就会是一个匹配.有什么比我在这里展示的蛮力方法更好的?

I'd like to have the condition be flexible; eg. instead of a strict startswith, it'd be a match so long as all letters in input appears in s in the same order. What's better than the brute-force method I'm showing here?

推荐答案

对于精确匹配,通常实现这样的事情的方法是将您的语料库存储在 trie.这个想法是将每个字母存储为树中的一个节点,链接到单词中的下一个字母.查找匹配项只是在树上行走,并显示您当前位置的所有子项.例如.cat"、cow"和car"将存储为:

For exact matching, generally the way to implement something like this is to store your corpus in a trie. The idea is that you store each letter as a node in the tree, linking to the next letter in a word. Finding the matches is simply walking the tree, and showing all children of your current location. eg. "cat", "cow" and "car" would be stored as:

  a--t
 / \ 
c   r
 \
  o--w

当你得到一个 c 时,你从 c 节点开始,然后一个 a 会带你到 c/a 节点(子节点"t" 和 "r",将 cat 和 car 作为你的补全.

When you get a c, you start at the c node, an a will then take you to the c/a node (children "t" and "r", making cat and car as your completions).

请注意,您还需要标记完整单词的节点来处理作为其他人的子字符串的名称(例如car"和cart")

Note that you'll also need to mark nodes that are complete words to handle names that are substrings of others (eg "car" and "cart")

要获得所需的模糊匹配,您可能需要进行一些更改.

To get the desired fuzzy matching, you may need to make some changes however.

这篇关于Python:在大型字符串语料库中查找部分字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆