最好的方法来测试现有的字符串对一个大的可比性列表 [英] Best way to test for existing string against a large list of comparables

查看:175
本文介绍了最好的方法来测试现有的字符串对一个大的可比性列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一个定义值(例如AB1,DE2,CC3)的首字母缩略词列表,您需要检查一个字符串值(例如Happy:DE2 | 234),看看是否找到一个首字母缩略词字符串。对于一个简短的首字母缩略词列表,我通常创建一个简单的RegEx,使用分隔符(例如(AB1 | DE2 | CC3)),只是寻找一个匹配。

Suppose you have a list of acronym's that define a value (ex. AB1,DE2,CC3) and you need to check a string value (ex. "Happy:DE2|234") to see if an acronym is found in the string. For a short list of acronym's I would usually create a simple RegEx that used a separator (ex. (AB1|DE2|CC3) ) and just look for a match.

但是,如果有超过30个首字母缩略词与之匹配,我该如何解决?使用相同的技术(丑陋的)或者有更有效和优雅的方式来完成这项任务是有意义的吗?

But how would I tackle this if there are over 30 acronym's to match against? Would it make sense to use the same technique (ugly) or is there a more effecient and elegant way to accomplish this task?

请记住示例首字母缩略词列表和示例字符串不是我正在使用的实际数据格式,而只是一种表达我的挑战的方式。

Keep in mind the example acronym list and example string is not the actual data format that I am working with, rather just a way to express my challenge.

BTW,我读了一个SO 相关问题,但没有认为它适用于我想要完成的内容。

BTW, I read a SO related question but didn't think it applied to what I was trying to accomplish.

编辑:我忘了包含我需要捕获匹配的值,因此选择使用正则表达式

I forgot to include my need to capture the matched value, hence the choice to use Regular Expressions...

推荐答案

个人我不认为30是特别大的正则表达式,所以我不会太快来排除。您可以用一行代码创建正则表达式:

Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

所以代码相对优雅和可维护。如果你知道首字母缩略词的数字的上限,我会进行一些测试,谁知道什么样的优化已经内置在正则表达式引擎中。您还可以从未来的regex引擎优化中获益。除非你有理由相信性能会是一个问题保持简单。

So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.

另一方面,正则表达式可能有其他限制。默认情况下,如果你有缩写AB,BC和CD,那么它只会返回其中两个作为匹配在ABCD。

On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.

当性能对我来说是一个问题(> 10,000个项目)时,我把HashSet中的'首字母缩略词',然后搜索文本的每个子字符串(从最小首字母缩写的长度到最大首字母缩写的长度)。这对我来说是确定的,因为源文本很短。我以前没有听说过,但是首先看看Aho-Corasick算法,在你引用的问题中引用,似乎是一个更好的一般解决方案这个问题。

When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.

这篇关于最好的方法来测试现有的字符串对一个大的可比性列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆