为表连接创建多个索引以适应模糊匹配 [英] Creating multiple indexes for table join to accommodate fuzzy matching

查看:159
本文介绍了为表连接创建多个索引以适应模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将用户提供的邮政地址数据与地址参考数据集相匹配。我想索引两个数据集并加入索引字段。在完美的世界中,这将使用由完整地址组成的密钥(例如, WHERE REF_ADDR = INPUT_ADDR 将给出 100 W Main St,Springfield, OH 45502 = 100 W Main St,Springfield,OH 45502 )。当然,地址很少是完美的,所以我有一个脚本可以使用模糊逻辑来适应差异。但是因为这个脚本非常慢,所以我想减少参考数据集中的候选数量,在使用它之前尝试匹配过程。为了找到所有可能的候选者,我打算创建一个索引键,该键是从用于连接的各个地址组件派生的。问题是,仅有一个关键不会捕获所有可能的候选人。我可能需要创建多个索引键才能捕获所有候选项。

I'm trying to match user-provided postal address data to an address reference dataset. I want to index both datasets and join on the indexed field. In a perfect world, this would use a key consisting of the full address (e.g., WHERE REF_ADDR = INPUT_ADDR will give 100 W Main St, Springfield, OH 45502 = 100 W Main St, Springfield, OH 45502). Of course, addresses are rarely perfect, so I have a script that can accommodate for differences using fuzzy logic. However because this script is very slow, I want to reduce the number of candidates from the reference dataset to which the matching process is attempted before it is used. To find all potential candidates, I intend to create an indexed key that is derived from individual address components to be used for joining. The problem is, one key alone will not capture all the possible candidates. I would likely need to create multiple indexed keys in order to capture all candidates.

例如,地址 100的 100 WMNST 455 形式的索引关键字W Main St,Springfield,OH 45502 大部分时间都会很好,但是这样的密钥不会捕获任何数量的地址错误。为了容纳匹配过程将识别的所有潜在错误,我可能需要实现至少几个索引密钥才能加入。

For example, an indexed key in the form of 100 WMNST 455 for address 100 W Main St, Springfield, OH 45502 will be good most of the time, but there can be any number of address errors that will not be caught by such a key. In order to accommodate all potential errors that the matching process will recognize, I would likely need to implement at least several indexed keys for joining.

我想知道是否有人处理此问题的任何建议。参考数据集由40M记录组成,用户提供的地址数据通常约为10,000条记录。简单地在地址字段上使用 LIKE OR 查询而不是我的方法更有效建议?在后一个数据集中遇到以下变化(由脚本适应)并不罕见:

I'm wondering if anyone has any recommendations for handling this issue. The reference dataset consists of 40M records, and the user-provided address data is typically around 10,000 records. Would it be more effective to simply use LIKE and OR queries on the address fields as opposed to the method I'm proposing? It is not unusual to encounter the following variations within the latter dataset (accommodated for by the script):

Address: 100 W MAIN
City: 
Zip: 45502

Address: 100 MAIN ST
City: SPNGFLD
Zip:

Address: 100 W MAIN STREET
City: SPRINGFIELD
Zip: 54502

Address: 100 MAIN
City: NORTHRIDGE
Zip: 45502


推荐答案

根据您使用的数据库系统,您必须尝试查看是否有任何内置功能可以使用。
例如,如果您正在使用SQL SERVER,我可以想到的选项是更改数据捕获,全文搜索,过滤索引等... ..
但无论数据库如何系统,如果你想开发自己的,可以在任何数据库系统上实现,那么这可能会让你感兴趣。

Depending on what DB system you are using you must have try to see if any inbuilt functionality can be used. For example if you are working on SQL SERVER, options I can think of is "Change Data Capture", "Full text search", "Filtered Index", etc….. But regardless of the DB system if you want to develop your own that can be implemented on any DB system then this might interest you.

你要问的是建议一些索引选项,但对我来说这不是一个正确的问题,因为随着表格中数据的增长和/或您的搜索条件变得复杂,您将受限于极少数选项。如果架构设计本身不具备可扩展性,那么您将无法在以后的极端数据情况下实现更多性能改进。

What you have ask is to suggest some indexing options but to me that is not the right question as you will be limited with very few options as the data grows in the table and/or your search criteria becomes complex. If schema design itself is not scalable then you will not be able to implement more performance improvements later in extreme data cases.

I创建设计以实现搜索所谓的Google like搜索在我们的项目中,而用户开始键入文本适当的匹配文本建议应该出现结果。
用户也可以通过设置来控制搜索类型。

I Created design to implement search so called "Google like Search" in our project whereas user start typing the text appropriate matching text suggestions should come up on result. Also user can control type of search should be performed by setting.

我的意思是完全匹配,相似匹配,以A开头,以A结尾或包含A 。

在您的情况下,地址是一种很少发生完全匹配的数据。所以我想你可以跳过它,但如果你想实现它,它可以做一些改变。您可以根据需要自定义它,具体取决于您要处理的复杂性和复杂性。
这是概念。

In your case Address is kind of Data where Exact Match is rarely happens. So i guess you can skip that but if you want to implement that, it can done with some changes. You can customize it as you need depending on the sophistication and complexity you want to handle. here’s the concept.

我们需要5个表。

现在问题是这个架构如何帮助或改善你的模糊搜索?

Now question is How does this schema help or improve your fuzzy search ?

请注意,每个表只有2个具有INTEGER和/或STRING类型的列,我们可以在包含两个列的每个表上具有聚簇索引。

Notice that each table has ONLY 2 Clumns with INTEGER and/OR STRING type, We can have Clustered index on each table that includes both column..

因为我们已经精确地分离出数据,所以您可以向用户提供用户想要访问的准确数据的选项。这将减少搜索负担并批量搜索操作。

Because we have separated out the data by accuracy you can give option to user how much accurate data user want to access. this will reduce the search load and also batch your search operation.

如果您想要这样做,请告诉我们。创建虚拟数据并提出性能数字并不是什么大问题。我可以提供可能适合您的最终设计。

这篇关于为表连接创建多个索引以适应模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆