在包含一些通配符的大列表中进行成员资格测试 [英] Membership testing in a large list that has some wildcards

查看:30
本文介绍了在包含一些通配符的大列表中进行成员资格测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当该列表包含特殊类别时,如何测试该短语是否在大型 (650k) 短语列表中?

How do I test whether a phrase is in a large (650k) list of phrases when that list includes special categories?

例如,我想测试短语 ["he", "had", "the", "nerve"] 是否在列表中.确实如此,但是在 ["he", "had", "!DETERMINER", "nerve"] 下,其中 "!DETERMINER" 是包含几种选择(a, an, the).我有大约 350 个词类,其中一些很长,所以我认为枚举列表中具有一个(或多个)词类的每个项目是不可行的.

For instance, I want to test if the phrase ["he", "had", "the", "nerve"] is in the list. It is, but under ["he", "had", "!DETERMINER", "nerve"] where "!DETERMINER" is the name of a wordclass that contains several choices (a, an, the). I have about 350 wordclasses and some of them are quite lengthy, so I don't think it would be feasible to enumerate each item in the list that has one (or more) wordclasses.

我想使用一组这样的短语,而不是慢慢地浏览一个列表,但我不知道如何处理词类的可变性.速度非常重要,因为我每次需要进行数十万次比较.

I would like to use a set of these phrases instead of slowly working my way through a list, but I don't know how to deal with the variability of the wordclasses. Speed is pretty important, since I need to make this comparison hundreds of thousands of times per go.

推荐答案

类似于 pjwerneck 的建议,你可以使用一棵树(或者更具体地说是一个 trie) 将列表分部分存储,但对其进行扩展以专门处理类别.

Similar to pjwerneck's suggestion, you could use a tree (or more specifically a trie) to store the lists in parts, but extend it to treat the categories specially.

# phrase_trie.py

from collections import defaultdict

CATEGORIES = {"!DETERMINER": set(["a","an","the"]),
              "!VERB": set(["walked","talked","had"])}

def get_category(word):
    for name,words in CATEGORIES.items():
        if word in words:
            return name
    return None

class PhraseTrie(object):
    def __init__(self):
        self.children = defaultdict(PhraseTrie)
        self.categories = defaultdict(PhraseTrie)

    def insert(self, phrase):
        if not phrase: # nothing to insert
            return

        this=phrase[0]
        rest=phrase[1:]

        if this in CATEGORIES: # it's a category name
            self.categories[this].insert(rest)
        else:
            self.children[this].insert(rest)

    def contains(self, phrase):
        if not phrase:
            return True # the empty phrase is in everything

        this=phrase[0]
        rest=phrase[1:]

        test = False

        # the `if not test` are because if the phrase satisfies one of the
        # previous tests we don't need to bother searching more

        # allow search for ["!DETERMINER", "cat"]
        if this in self.categories: 
            test = self.categories[this].contains(rest)

        # the word is literally contained
        if not test and this in self.children:
            test = self.children[this].contains(rest)

        if not test:
            # check for the word being in a category class like "a" in
            # "!DETERMINER"
            cat = get_category(this)
            if cat in self.categories:
                test = self.categories[cat].contains(rest)
        return test

    def __str__(self):
        return '(%s,%s)' % (dict(self.children), dict(self.categories))
    def __repr__(self):
        return str(self)

if __name__ == '__main__':
    words = PhraseTrie()
    words.insert(["he", "had", "!DETERMINER", "nerve"])
    words.insert(["he", "had", "the", "evren"])
    words.insert(["she", "!VERB", "the", "nerve"])
    words.insert(["no","categories","here"])

    for phrase in ("he had the nerve",
                   "he had the evren",
                   "she had the nerve",
                   "no categories here",
                   "he didn't have the nerve",
                   "she had the nerve more"):
        print '%25s =>' % phrase, words.contains(phrase.split())

运行python phrase_trie.py:

         he had the nerve => True
         he had the evren => True
        she had the nerve => True
       no categories here => True
 he didn't have the nerve => False
   she had the nerve more => False

关于代码的一些要点:

  • defaultdict 的使用是为了避免在调用insert 之前必须检查该子树是否存在;它会在需要时自动创建和初始化.
  • 如果要对 get_category 进行大量调用,那么构建一个反向查找字典以提高速度可能是值得的.(或者,更好的是,记住对 get_category 的调用,这样常用词可以快速查找,但不会浪费内存存储您从未查找过的词.)
  • 代码假设每个词只属于一个类别.(如果没有,唯一的变化是 get_category 返回一个列表和 PhraseTrie 的相关部分循环遍历这个列表.)
  • The use of defaultdict is to avoid having to check if that sub-trie exists before calling insert; it is automatically created and initialised when needed.
  • If there are going to be a lot of calls to get_category, it might be worth constructing a reverse look-up dictionary for speed. (Or, even better, memoise the calls to get_category so that common words have fast look-ups but you don't waste the memory storing words you never look up.)
  • The code assumes that each word is in only one category. (If not, the only changes are get_category returning a list and the relevant section of PhraseTrie looping through this list.)

这篇关于在包含一些通配符的大列表中进行成员资格测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆