URL 的最长前缀匹配 [英] Longest Prefix Matches for URLs

查看:67
本文介绍了URL 的最长前缀匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要有关可用于 URL 上最长前缀匹配"的任何标准 python 包的信息.我已经浏览了两个标准包 http://packages.python.org/PyTrie/#pytrie.StringTrie &'http://pypi.python.org/pypi/trie/0.1.1' 但它们似乎对 URL 上的最长前缀匹配任务没有用.

I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

例如,如果我的集合有这些 URL 1->http://www.google.com/mail、2->http://www.google.com/document、3->http://www.facebook.com 等.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

现在,如果我搜索http://www.google.com/doc",它应该返回 2,搜索http://www.face"应该返回 3.

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

我想确认是否有任何标准的 python 包可以帮助我做到这一点,或者我应该实现一个 Trie 来进行前缀匹配.

I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.

我不是在寻找正则表达式类型的解决方案,因为它无法随着 URL 数量的增加而扩展.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

非常感谢.

推荐答案

此示例适用于小型 url 列表,但不能很好地扩展.

This example is good for small url lists but does not scale well.

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

使用 trie 模块的实现.

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

结果:

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

或使用 PyTrie 给出相同的结果,但列表的顺序不同.

or using PyTrie which gives the same result but the lists are ordered differently.

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

我开始认为 基数树/帕特里夏树会更好一些使用角度.这是基数树的样子:

I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:

而特里看起来更像:

这篇关于URL 的最长前缀匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆