高效的字符串匹配算法 [英] Efficient string matching algorithm

查看:124
本文介绍了高效的字符串匹配算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图建立一个高效的字符串匹配算法。这将在一个大容量的环境中执行,所以性能是至关重要的。

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

下面是我的要求:

  • 在给定的域名,如www.example.com,确定是否匹配之一条目列表。
  • 在参赛作品可以是绝对匹配,即www.example.com。
  • 条目可包含通配符,如* .example.com的。
  • 通配符条目从最定义的水平和热身赛。例如,* .example.com的将匹配www.example.com,example.com,并sub.www.example.com。
  • 通配符条目没有被嵌入,即子。*。example.com将不会是一个条目。

语言/环境:C#(.NET Framework 3.5中)

Language/environment: C# (.Net Framework 3.5)

我认为分裂的条目(和域名查询)到数组,扭转了订单,然后通过数组迭代。虽然准确,感觉慢。

I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.

我认为正则表达式,但我担心准确地重新presenting项定期EX pressions名单。

I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.

我的问题:什么是如果一个字符串发现,在域名形式的有效方式,匹配任何一个字符串列表,因为上面列出的说明

My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

推荐答案

如果你想推出自己的,我将存储的条目以树状结构。请参见我回答另一个SO质疑有关拼写检查,看看有什么我意思。

If you're looking to roll your own, I would store the entries in a tree structure. See my answer to another SO question about spell checkers to see what I mean.

而不是标记化的结构。人物,我只想把每个条​​目作为一个完整的字符串。任何符号化的实施将仍然需要做的整套字符字符串匹配,无论如何,所以你不妨做这一切在一杆。

Rather than tokenize the structure by "." characters, I would just treat each entry as a full string. Any tokenized implementation would still have to do string matching on the full set of characters anyway, so you may as well do it all in one shot.

这和常规拼写检查树之间的唯一区别是:

The only differences between this and a regular spell-checking tree are:

  1. 的匹配需要在反向进行
  2. 您必须考虑到通配符
  1. The matching needs to be done in reverse
  2. You have to take into account the wildcards

要解决点#2,您只需在测试结束检查的*字符。

To address point #2, you would simply check for the "*" character at the end of a test.

一个简单的例子:

条目:

*.fark.com
www.cnn.com

树:

m -> o -> c -> . -> k -> r -> a -> f -> . -> *
                \
                 -> n -> n -> c -> . -> w -> w -> w

检查www.blog.fark.com将涉及通过树向上回溯到第一*。由于在结束了遍历一个*,有一个匹配。

Checking www.blog.fark.com would involve tracing through the tree up to the first "*". Because the traversal ended on a "*", there is a match.

检查www.cern.com会失败n的第二个N,N,C,...

Checking www.cern.com would fail on the second "n" of n,n,c,...

检查dev.www.cnn.com也将失败,因为遍历一个字符不是*

Checking dev.www.cnn.com would also fail, since the traversal ends on a character other than "*".

这篇关于高效的字符串匹配算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆