数据匹配算法 [英] Data matching algorithm

查看:618
本文介绍了数据匹配算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在从事一个需要实施数据匹配算法的项目。
外部系统传入它知道的有关客户的所有数据,而我设计的系统必须返回匹配的客户。这样,外部系统便知道了客户的正确ID,并获得了其他数据,或者可以更新特定客户的自己的数据。

I am currently working on a project where I a data matching algorithm needs to be implemented. An external system passes in all data it knows about a customer, and the system I design has to return the customer matched. So the external system then knows the correct id of the customer plus it gets additional data or can update its own data of the specific customer.

以下字段被传递:


  • 名称

  • Name2

  • 街道

  • 城市

  • 邮政编码

  • BankAccountNumber

  • BankName

  • BankCode

  • 电子邮件

  • 电话

  • 传真

  • Web

  • Name
  • Name2
  • Street
  • City
  • ZipCode
  • BankAccountNumber
  • BankName
  • BankCode
  • Email
  • Phone
  • Fax
  • Web

数据可以是高质量的,并且有很多可用的信息,但是通常数据很糟糕而且仅仅是

The data can be of high quality and alot of information is available, but often the data is crappy and just the name and address is available and might have spellings.

我正在.Net中实现该项目。我目前正在执行以下操作:

I'm implementing the project in .Net. What I currently do is something like the following:

public bool IsMatch(Customer customer)
{
    // CanIdentify just checks if the info is provided and has a specific length (e.g. > 1)
    if (CanIdentifyByStreet() && CanIdentifyByBankAccountNumber())
    {
        // some parsing of strings done before (substring, etc.)
        if(Street == customer.Street && AccountNumber == customer.BankAccountNumber) return true;
    }
    if (CanIdentifyByStreet() && CanIdentifyByZipCode() &&CanIdentifyByName())
    {
        ...
    }
}

我对上述方法不太满意。这是因为我必须为所有合理的情况(组合)编写if语句,这样我才不会错过匹配该实体的任何机会。

I am not very happy with the approach above. This is because I would have to write if statements for all reasonable cases (combinations) so I don't miss any chance of matching the entity.

所以我想也许我可以创建某种匹配分数。因此,对于每个匹配的标准,将添加一个分数。像这样:

So I thought maybe I could create some kind of matching score. So for each criteria matched, a score would be added. Like:

public bool IsMatch(Customer customer)
{
    int matchingScore = 0;
    if (CanIdentifyByStreet())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyByName())
    {
        if(....)
            matchingScore += 10;
    }
    if (CanIdentifyBankAccountNumber())
    {
        if(....)
            matchingScore += 10;
    }

    if(matchingScore > iDontKnow)
        return true;
}

这将允许我考虑所有匹配的数据,并取决于某些体重我会增加匹配分数。如果分数足够高,那就是一个匹配项。

This would allow me to take in consideration all matching data, and depending on some weight I would increase the matching score. If the score is high enough, it's a match.

知道我的问题是:是否有针对此类情况的最佳实践,例如匹配算法模式等?

Know my question is: Are there any best practices out there for such things, like matching algorithm patterns etc? Thanks alot!

推荐答案

为获得灵感,请查看 Levenshtein距离算法

For inspiration, look at the Levenshtein distance algorithm. This will give you a reasonable mechanism to weight your comparisons.

我还将补充一点,根据我的经验,您永远无法将两个任意数据块与具有绝对值的同一实体匹配肯定。您需要向用户提供合理的匹配条件,然后用户才能确定1920 E. Pine上的John Smith是否与East Pine Road 192上的Jon Smith是同一个人。

I would also add that in my experience you can never match two arbitrary pieces of data into the same entity with absolute certainty. You need to present plausible matches to a user, who can then verify for sure that John Smith on 1920 E. Pine is the same person as Jon Smith on 192 East Pine Road or not.

这篇关于数据匹配算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆