针对大量可比对象测试现有字符串的最佳方法 [英] Best way to test for existing string against a large list of comparables

查看:14
本文介绍了针对大量可比对象测试现有字符串的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Suppose you have a list of acronym's that define a value (ex. AB1,DE2,CC3) and you need to check a string value (ex. "Happy:DE2|234") to see if an acronym is found in the string. For a short list of acronym's I would usually create a simple RegEx that used a separator (ex. (AB1|DE2|CC3) ) and just look for a match.

But how would I tackle this if there are over 30 acronym's to match against? Would it make sense to use the same technique (ugly) or is there a more effecient and elegant way to accomplish this task?

Keep in mind the example acronym list and example string is not the actual data format that I am working with, rather just a way to express my challenge.

BTW, I read a SO related question but didn't think it applied to what I was trying to accomplish.

EDIT: I forgot to include my need to capture the matched value, hence the choice to use Regular Expressions...

解决方案

Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.

On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.

When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.

这篇关于针对大量可比对象测试现有字符串的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆