如何从列表中删除类似的字符串? [英] How to remove similar string from a list?

查看:47
本文介绍了如何从列表中删除类似的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从列表中删除相似字符串的有效方法是什么?

考虑一个由以下(和其他)字符串组成的 List< string> :

"SRS投资管理有限公司"

"SRS投资管理"

"Maplelane Capital,Ltd."

"Maplelane Capital,Limited"

所以我需要做的是删除足够相似"的字符串.我的想法是,应该通过将列表中的所有字符串都大写,然后删除与另一个字符串中除最后X个字符之外的所有字符串都匹配的任何字符串来完成.最后,我希望这给我留下一个列表,其中只包含它们实际代表的每个真实公司的一个字符串.

关于如何实现这一目标的任何想法?

我建议您创建一个IEqualityComparer来封装逻辑以确定两个字符串是否相等.

如果您想混合并匹配SoundEx和Levenshtein的示例可能类似于

 公共类CompanyNameComparer:IEqualityComparer< string>{公共布尔值等于(字符串x,字符串y){如果(x == null&&y == null){返回true;}如果(x == null || y == null){返回false;}var src1 = FormatString(x);var src2 = FormatString(y);如果(src1 == src2){返回true;}变量差= CalcLevenshteinDistance(src1,src2);//任意数字,您将需要找到有效的方法返回差<7;}私有字符串FormatString(字符串源){返回source.Trim().ToUpper();}//代码取自http://stackoverflow.com/a/9453762/1798889private int CalcLevenshteinDistance(字符串a,字符串b){//不包含代码}public int GetHashCode(字符串obj){返回Soundex(obj).G​​etHashCode();}私有字符串Soundex(字符串数据){//不包含代码}} 

我没有包含所有代码,因为这不是重点.只有您会知道SoundEx和Levenshtein是否可以工作,或者是否需要其他功能.但是,如果您需要调整决策,则将其放在自己的类中,这只是需要更改的地方.

然后,您可以使用Linq或HashSet获得唯一列表.假设数据是列表变量的名称

  var uniqueEnumerable = data.Distinct(new CompanyNameComparer());var uniqueSet =新的HashSet< string>(数据,新的CompanyNameComparer()); 

What would be an efficient way to remove similar strings from a list?

Consider a List<string> consisting of these (and other) strings:

"SRS INVESTMENT MANAGEMENT, LLC"

"SRS INVESTMENT MANAGEMENT"

"Maplelane Capital, Ltd."

"Maplelane Capital, Limited"

So what I need to do is remove strings that are 'similar enough'. My idea is that this should be done by capitalising all the strings of the list, and then remove any string that matches all except the last X characters of another string. In the end I want this to leave me with a list containing only one string for each real-life company that they actually represent.

Any ideas on how I can achieve this?

解决方案

I would suggest you create an IEqualityComparer to encapsulate the logic to determine if two strings are equal.

An example if you wanted to mix and match SoundEx and Levenshtein might be something like

public class CompanyNameComparer : IEqualityComparer<string>
{

    public bool Equals(string x, string y)
    {
        if (x == null && y == null)
        {
            return true;
        }
        if (x == null || y == null)
        {
            return false;
        }

        var src1 = FormatString(x);
        var src2 = FormatString(y);

        if (src1 == src2)
        {
            return true;
        }

        var difference = CalcLevenshteinDistance(src1, src2);

        // arbitrary number you will need to find what works
        return difference < 7;
    }

    private string FormatString(string source)
    {
        return source.Trim().ToUpper();
    }

    // code taken from http://stackoverflow.com/a/9453762/1798889
    private int CalcLevenshteinDistance(string a, string b)
    {
       // code not included 
    }

    public int GetHashCode(string obj)
    {
        return Soundex(obj).GetHashCode();
    }

    private string Soundex(string data)
    {
        // code not included 
    }
}

I didn't include all the code because that's not the main point. Only you will know if SoundEx and Levenshtein will work or if it needs to be something else. But if you put that decision making in it's own class if it needs to be tweaked it's just one place that needs to be changed.

Then you can get a unique list either with Linq or a HashSet. Assuming data is the name of your variable of a List

var uniqueEnumerable = data.Distinct(new CompanyNameComparer());
var uniqueSet = new HashSet<string>(data, new CompanyNameComparer());

这篇关于如何从列表中删除类似的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆