如何在 Python 中创建一个尝试 [英] How to create a trie in Python

查看:55
本文介绍了如何在 Python 中创建一个尝试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对尝试和 DAWG(直接非循环词图)感兴趣,我已经阅读了很多关于它们的内容,但我不明白输出的 trie 或 DAWG 文件应该是什么样子.

  • trie 应该是嵌套字典的对象吗?每个字母在哪里分成字母等等?
  • 如果有 100k 或 500k 条目,在这样的字典上执行查找会很快吗?
  • 如何实现由多个单词组成的单词块,以 - 或空格分隔?
  • 如何将单词的前缀或后缀链接到结构中的另一部分?(对于 DAWG)

我想了解最好的输出结构,以便弄清楚如何创建和使用.

我也很欣赏 DAWG 的输出trie.

我不想看到气泡相互链接的图形表示,我想知道一旦一组单词变成尝试或 DAWG 的输出对象.

解决方案

Unwind 本质上是正确的,有很多不同实现 trie 的方法;对于一个大型的、可扩展的特里,嵌套字典可能会变得很麻烦——或者至少空间效率低下.但既然你刚刚开始,我认为这是最简单的方法;你可以用几行代码编写一个简单的trie.首先,构造trie的函数:

<预><代码>>>>_end = '_end_'>>>>>>def make_trie(*words):... root = dict()...一字一句:... current_dict = 根... 对于字中的字母:... current_dict = current_dict.setdefault(letter, {})... current_dict[_end] = _end...返回根...>>>make_trie('foo', 'bar', 'baz', 'barz'){'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}},'z': {'_end_': '_end_'}}},'f': {'o': {'o': {'_end_': '_end_'}}}}

如果您不熟悉 setdefault,它只是在字典中查找一个键(这里是 letter_end).如果键存在,则返回关联的值;如果没有,它会为该键分配一个默认值并返回该值({}_end).(这就像 get 的一个版本还会更新字典.)

接下来是一个测试单词是否在trie中的函数:

<预><代码>>>>def in_trie(trie, word):... current_dict = 尝试... 对于字中的字母:...如果字母不在 current_dict 中:...返回假... current_dict = current_dict[字母]...在current_dict中返回_end...>>>in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')真的>>>in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')真的>>>in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')错误的>>>in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')错误的>>>in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')错误的

我将插入和移除作为练习留给您.

当然,Unwind 的建议不会更难.可能存在轻微的速度劣势,因为找到正确的子节点需要线性搜索.但是搜索将限于可能的字符数——如果我们包含 _end,则为 27.此外,按照他的建议,通过创建大量节点并按索引访问它们没有任何好处;你也可以嵌套列表.

最后,我要补充一点,创建有向无环词图 (DAWG) 会稍微复杂一些,因为您必须检测当前词与结构中的另一个词共享后缀的情况.事实上,这可能会变得相当复杂,这取决于您希望如何构建 DAWG!你可能需要学习一些关于 Levenshtein 距离 以使其正确.

I'm interested in tries and DAWGs (direct acyclic word graph) and I've been reading a lot about them but I don't understand what should the output trie or DAWG file look like.

  • Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
  • Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
  • How to implement word-blocks consisting of more than one word separated with - or space?
  • How to link prefix or suffix of a word to another part in the structure? (for DAWG)

I want to understand the best output structure in order to figure out how to create and use one.

I would also appreciate what should be the output of a DAWG along with trie.

I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.

解决方案

Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome -- or at least space inefficient. But since you're just getting started, I think that's the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:

>>> _end = '_end_'
>>> 
>>> def make_trie(*words):
...     root = dict()
...     for word in words:
...         current_dict = root
...         for letter in word:
...             current_dict = current_dict.setdefault(letter, {})
...         current_dict[_end] = _end
...     return root
... 
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}}, 
             'z': {'_end_': '_end_'}}}, 
 'f': {'o': {'o': {'_end_': '_end_'}}}}

If you're not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It's like a version of get that also updates the dictionary.)

Next, a function to test whether the word is in the trie:

>>> def in_trie(trie, word):
...     current_dict = trie
...     for letter in word:
...         if letter not in current_dict:
...             return False
...         current_dict = current_dict[letter]
...     return _end in current_dict
... 
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False

I'll leave insertion and removal to you as an exercise.

Of course, Unwind's suggestion wouldn't be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters -- 27 if we include _end. Also, there's nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.

Finally, I'll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.

这篇关于如何在 Python 中创建一个尝试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆