实现Patricia Trie用作字典 [英] Implementing a Patricia Trie for use as a dictionary

查看:143
本文介绍了实现Patricia Trie用作字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用方法 addWord() isWord()来实现Patricia Trie,以及 isPrefix()作为存储大型单词词典以快速检索(包括前缀搜索)的方法。我已经阅读了这些概念,但他们只是没有澄清实现。我想知道(在Java或Python代码中)如何实现Trie,特别是节点(或者我应该递归地实现它)。我看到一个人使用26个子节点的数组设置为null / None来实现它。是否有更好的策略(例如将字母视为位)以及如何实现它?

I'm attempting to implement a Patricia Trie with the methods addWord(), isWord(), and isPrefix() as a means to store a large dictionary of words for quick retrieval (including prefix search). I've read up on the concepts but they just aren't clarifying into an implementation. I want to know (in Java or Python code) how to implement the Trie, particularly the nodes (or should I implement it recursively). I saw one person who implemented it with an array of 26 child nodes set to null/None. Is there a better strategy (such as treating the letters as bits) and how would you implement it?

推荐答案

有人问过关于帕特里夏尝试的问题不久前,我想过制作一个Python实现,但这次我决定实际给它一个机会(是的,这是过分的,但它似乎是一个不错的小项目)。我所做的可能不是纯粹的Patricia trie实现,但我更喜欢我的方式。其他Patricia尝试(在其他语言中)只为孩子使用一个列表并检查每个孩子是否有匹配,但我认为这是相当低效的,所以我使用词典。基本上我是如何设置的:

Someone else asked a question about Patricia tries a while ago and I thought about making a Python implementation then, but this time I decided to actually give it a shot (Yes, this is way overboard, but it seemed like a nice little project). What I have made is perhaps not a pure Patricia trie implementation, but I like my way better. Other Patricia tries (in other languages) use just a list for the children and check each child to see there is a match, but I thought this was rather inefficient so I use dictionaries. Here is basically how I've set it up:

我将从根节点开始。根只是一本字典。字典具有通向分支的所有单个字符(单词的第一个字母)的键。与每个键对应的值是列表,其中第一项是字符串,其给出与trie的该分支匹配的字符串的其余部分,第二项是从该节点引出进一步分支的字典。这个字典也有单个字符键,对应于单词其余部分的第一个字母,然后进程沿着trie继续。

I'll start at the root node. The root is just a dictionary. The dictionary has keys that are all single characters (the first letters of words) leading to branches. The values corresponding with each key are lists where the first item is a string which gives the rest of the string that matches with this branch of the trie, and the second item is a dictionary leading to further branches from this node. This dictionary also has single character keys that correspond with the first letter of the rest of the word and the process continues down the trie.

另一件事我应该提到的是,如果给定节点有分支,但也是trie本身的一个单词,然后通过在字典中有一个''键表示,该键导致带有列表的节点 ['',{}]

Another thing I should mention is that if a given node has branches, but also is a word in the trie itself, then that is denoted by having a '' key in the dictionary that leads to a node with the list ['',{}].

这是一个显示单词存储方式的小例子(根节点)是变量 _d ):

Here's a small example that shows how words are stored (the root node is the variable _d):

>>> x = patricia()
>>> x.addWord('abcabc')
>>> x._d
{'a': ['bcabc', {}]}
>>> x.addWord('abcdef')
>>> x._d
{'a': ['bc', {'a': ['bc', {}], 'd': ['ef', {}]}]}
>>> x.addWord('abc')
{'a': ['bc', {'a': ['bc', {}], '': ['', {}], 'd': ['ef', {}]}]}

请注意,在最后一种情况下,'' '密钥被添加到词典中以表示'abc'是除了'abcdef'和'abcabc'之外的单词。

Notice that in the last case, a '' key was added to the dictionary to denote that 'abc' is a word in a addition to 'abcdef' and 'abcabc'.

源代码

class patricia():
    def __init__(self):
        self._data = {}

    def addWord(self, word):
        data = self._data
        i = 0
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                if data:
                    data[word[i:i+1]] = [word[i+1:],{}]
                else:
                    if word[i:i+1] == '':
                        return
                    else:
                        if i != 0:
                            data[''] = ['',{}]
                        data[word[i:i+1]] = [word[i+1:],{}]
                return

            i += 1
            if word.startswith(node[0],i):
                if len(word[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                        except KeyError:
                            data = node[1]
                            data[''] = ['',{}]
                    return
                else:
                    i += len(node[0])
                    data = node[1]
            else:
                ii = i
                j = 0
                while ii != len(word) and j != len(node[0]) and \
                      word[ii:ii+1] == node[0][j:j+1]:
                    ii += 1
                    j += 1
                tmpdata = {}
                tmpdata[node[0][j:j+1]] = [node[0][j+1:],node[1]]
                tmpdata[word[ii:ii+1]] = [word[ii+1:],{}]
                data[word[i-1:i]] = [node[0][:j],tmpdata]
                return

    def isWord(self,word):
        data = self._data
        i = 0
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                return False
            i += 1
            if word.startswith(node[0],i):
                if len(word[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                        except KeyError:
                            return False
                    return True
                else:
                    i += len(node[0])
                    data = node[1]
            else:
                return False

    def isPrefix(self,word):
        data = self._data
        i = 0
        wordlen = len(word)
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                return False
            i += 1
            if word.startswith(node[0][:wordlen-i],i):
                if wordlen - i > len(node[0]):
                    i += len(node[0])
                    data = node[1]
                else:
                    return True
            else:
                return False

    def removeWord(self,word):
        data = self._data
        i = 0
        while 1:
            try:
                node = data[word[i:i+1]]
            except KeyError:
                print "Word is not in trie."
                return
            i += 1
            if word.startswith(node[0],i):
                if len(word[i:]) == len(node[0]):
                    if node[1]:
                        try:
                            node[1]['']
                            node[1].pop('')
                        except KeyError:
                            print "Word is not in trie."
                        return
                    data.pop(word[i-1:i])
                    return
                else:
                    i += len(node[0])
                    data = node[1]
            else:
                print "Word is not in trie."
                return


    __getitem__ = isWord

你可能已经注意到,最后我将 __ getitem __ 设置为isWord方法。这意味着

You may have noticed that at the end I set __getitem__ to the isWord method. This means that

x['abc']

将返回trie中是否为'abc'。

will return whether 'abc' in the trie or not.

我想也许我应该制作一个模块并将其提交给PyPI,但它需要更多测试,至少需要一个removeWord方法。如果您发现任何错误让我知道,但它似乎工作得很好。此外,如果你看到效率有任何重大改进,我也想听听它们。我考虑过在每个分支的底部都有空字典,但我现在就离开了。这些空字典可以替换为链接到单词的数据,以扩展实现的用途。

I think that maybe I should make a module out of this and submit it to PyPI, but it needs more testing and at least a removeWord method. If you find any bugs let me know, but it seems to be working pretty well. Also, if you see any big improvements in efficiency I would also like to hear about them. I've considered doing something about having empty dictionaries at the bottom of each branch, but I'm leaving it for now. These empty dictionaries may be replaced with data linked to the word to expand the uses of the implementation for instance.

无论如何,如果你不喜欢我实现它的方式,至少也许这会给你一些关于如何实现自己版本的想法。

Anyway, if you don't like the way I implemented it, at least maybe this will give you some ideas about how you would like to implement your own version.

这篇关于实现Patricia Trie用作字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆