计算上下文相关的文本相关性 [英] Calculating context-sensitive text correlation

查看:160
本文介绍了计算上下文相关的文本相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想匹配的地址记录(或负责人姓名或其他)相互合并的记录最有可能指的是同一个地址。基本上,我想我想计算出某种文本值之间的相关性和合并的记录,如果该值超过一定的阈值。

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.

例: 西剪草机驱动器54 A可能是同为W.割草机54A博士,但不同于东方剪草机驱动器54 A。

Example: "West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".

您将如何处理这个问题?是否有必要有某种的人都知道,在地址情况下,基于上下文的意思,即W,W。和西部大开发是一样的吗?怎么样拼写错误(先行者,而不是割草机等)?

How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?

我觉得这是一个棘手的一个 - 或许还有一些著名的算法在那里

I think this is a tricky one - perhaps there are some well-known algorithms out there?

推荐答案

一个很好的基线的,可能是一个不切实际之一,其较高的计算成本而言更重要的是其生产的许多假阳性,将普通的字符串距离算法,如

A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as

  • 修改距离 (又名Levenshtein距离)
  • <一个href="http://www.itl.nist.gov/div897/sqg/dads/HTML/ratcliffObershelp.html">Ratcliff/Obershelp

根据精度要求的水平(其中,顺便说一句,无论是在它的召回 precision ,即一般EX pressing是否更重要的是错过相关,而不是错误地识别一个),< STRONG>根据[某些]土生土长的过程如下启发和想法可以做的伎俩

Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:

  • 标记化的​​投入,即看到输入的单词的数组,而不是字符串
  • 标记化也应该保持行号信息
  • 正常化输入使用共同substituions(如博士在一行=驱动器的结束,杰克=约翰,条例=威廉的短字典.. ,W.在一行的开头是西部大开发等。
  • 确定(有点像标记,如词性标注)一些实体(如ZIP code的性质,并扩展ZIP code,还城市
  • 确定(查找)一些实体(例如一个相对短的数据库表可以包括所有的城市/城镇在目标区域
  • 确定(查找)某一领域的相关实体(如果所有/许多地址处理乡亲说在法律界,联邦大厦的律师事务所名称或查找可能有帮助的。
  • 在通常情况下,更加看重令牌,来自该地址的最后一行
  • 把更多(或更少)的重量与特定实体类型的标记(例如:驱动器,街,苑应该比这precede他们的标记少得多
  • 在考虑修改后的 SOUNDEX 算法,以帮助正常化
  • tokenize the input, i.e. see the input as an array of words rather than a string
  • tokenization should also keep the line number info
  • normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
  • Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
  • Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
  • Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
  • Generally, put more weight on tokens that come from the last line of the address
  • Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
  • Consider a modified SOUNDEX algorithm to help with normalization of

通过上面的考虑,实施的规则为基础的评估。姑且,该规则可以被实现为游人一棵树/阵列状的结构,其中输入最初解析( Visitor设计模式)。
基于规则的框架的优势,是每个启发式是在它自己的功能和规则可以优先,即早期把一些规则链,从而提前终止的评价,一些强大的启发式(如:不同的城市= >相关= 0时,置信度= 95%等等)。

With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).

与搜索相关的一个重要的考虑因素是需要的的先验的每一个项目(在这里讨论)与其他所有项目相比,因此需要多达 1/2 N ^ 2 单品级比较。正因为如此,它可能是有用的存储的方式,他们是$ P $对 - 处理(解析,归一...)并且还可能有形式的的消化/键可以用来作为可能的相关性[非常粗略]指示器(例如由5位ZIP- code随后的主名称的SOUNDEX值的键)。

An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).

这篇关于计算上下文相关的文本相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆