为表连接创建多个索引以适应模糊匹配 [英] Creating multiple indexes for table join to accommodate fuzzy matching

查看：159 发布时间：2018/8/2 15:41:20 sql indexing

本文介绍了为表连接创建多个索引以适应模糊匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将用户提供的邮政地址数据与地址参考数据集相匹配。我想索引两个数据集并加入索引字段。在完美的世界中，这将使用由完整地址组成的密钥（例如， WHERE REF_ADDR = INPUT_ADDR 将给出 100 W Main St，Springfield， OH 45502 = 100 W Main St，Springfield，OH 45502 ）。当然，地址很少是完美的，所以我有一个脚本可以使用模糊逻辑来适应差异。但是因为这个脚本非常慢，所以我想减少参考数据集中的候选数量，在使用它之前尝试匹配过程。为了找到所有可能的候选者，我打算创建一个索引键，该键是从用于连接的各个地址组件派生的。问题是，仅有一个关键不会捕获所有可能的候选人。我可能需要创建多个索引键才能捕获所有候选项。

I'm trying to match user-provided postal address data to an address reference dataset. I want to index both datasets and join on the indexed field. In a perfect world, this would use a key consisting of the full address (e.g., WHERE REF_ADDR = INPUT_ADDR will give 100 W Main St, Springfield, OH 45502 = 100 W Main St, Springfield, OH 45502). Of course, addresses are rarely perfect, so I have a script that can accommodate for differences using fuzzy logic. However because this script is very slow, I want to reduce the number of candidates from the reference dataset to which the matching process is attempted before it is used. To find all potential candidates, I intend to create an indexed key that is derived from individual address components to be used for joining. The problem is, one key alone will not capture all the possible candidates. I would likely need to create multiple indexed keys in order to capture all candidates.

例如，地址 100的 100 WMNST 455 形式的索引关键字W Main St，Springfield，OH 45502 大部分时间都会很好，但是这样的密钥不会捕获任何数量的地址错误。为了容纳匹配过程将识别的所有潜在错误，我可能需要实现至少几个索引密钥才能加入。

For example, an indexed key in the form of 100 WMNST 455 for address 100 W Main St, Springfield, OH 45502 will be good most of the time, but there can be any number of address errors that will not be caught by such a key. In order to accommodate all potential errors that the matching process will recognize, I would likely need to implement at least several indexed keys for joining.

我想知道是否有人处理此问题的任何建议。参考数据集由40M记录组成，用户提供的地址数据通常约为10,000条记录。简单地在地址字段上使用 LIKE 和 OR 查询而不是我的方法更有效建议？在后一个数据集中遇到以下变化（由脚本适应）并不罕见：

I'm wondering if anyone has any recommendations for handling this issue. The reference dataset consists of 40M records, and the user-provided address data is typically around 10,000 records. Would it be more effective to simply use LIKE and OR queries on the address fields as opposed to the method I'm proposing? It is not unusual to encounter the following variations within the latter dataset (accommodated for by the script):

Address: 100 W MAIN
City: 
Zip: 45502

Address: 100 MAIN ST
City: SPNGFLD
Zip:

Address: 100 W MAIN STREET
City: SPRINGFIELD
Zip: 54502

Address: 100 MAIN
City: NORTHRIDGE
Zip: 45502

为表连接创建多个索引以适应模糊匹配 [英] Creating multiple indexes for table join to accommodate fuzzy matching

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为表连接创建多个索引以适应模糊匹配 [英] Creating multiple indexes for table join to accommodate fuzzy matching

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭