Lucene Fuzzy搜索客户名称和部分地址 [英] Lucene Fuzzy Search for customer names and partial address

查看：88 发布时间：2020/5/4 7:37:07 lucene fuzzy-search

本文介绍了Lucene Fuzzy搜索客户名称和部分地址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我将浏览所有现有的问题帖子，但没有任何相关性.

I was going thru all the existing questions posts but couldn't get something much relevant.

我的档案中有数百万条记录，记录了人的名字，姓氏，地址1，地址2，国家/地区代码，出生日期-我想每天查看具有上述文件的客户列表(我的客户列表也会每天更新，文件也每天更新).

I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).

对于名字和姓氏，我想进行模糊匹配(可能是lucene Fuzzyquery/levenshtein距离为90％匹配)，对于其余字段，国家和出生日期，我希望完全匹配.

For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.

我是Lucene的新手，但是通过查看帖子数，似乎很可能.

I am new to Lucene, but by looking at number of posts, looks like its possible.

我的问题是:

我应该如何为输入文件建立索引?我需要基于FN，LN，国家/地区，DOB的组合来建立索引，并使用该索引进行搜索
如何在此处使用Lucene的模糊查询?

还有其他方法可以实现相同的功能吗?

Is there any other way I can implement the same?

推荐答案

Rushik，这里有一些想法:

Rushik, here are a few ideas:

考虑使用 Solr .与裸露的Lucene相比，开始使用它要容易得多.
构建文件的Lucene/Solr索引.如果您使用多值字段或两个不同的字段作为地址，则似乎每个客户的文档就足够了.
每个人都有唯一的ID吗?要使用Solr，您需要一个.在Lucene中，您无需使用唯一的ID就可以逃脱.
将国家/地区代码存储为关键字".如果您只要求完全匹配出生日期，则可以执行相同的操作.对于范围查询，您将需要其他表示形式.
我认为您的客户列表小于该文件.可能的策略是每天为文件中的更改编制索引(在这里，唯一的ID确实很方便-否则您需要通过查询删除，这可能会遗漏标记).然后，您可以优化索引，然后搜索更新的客户列表.
您描述的是一个 BooleanQuery ，Whose子句是对名字和姓氏的模糊查询，对其他字段是词条查询.您可以以编程方式创建查询，也可以使用查询解析器.
考虑在此处中所述的名称使用soundex.

Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.

这篇关于Lucene Fuzzy搜索客户名称和部分地址的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Lucene Fuzzy搜索客户名称和部分地址 [英] Lucene Fuzzy Search for customer names and partial address

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Lucene Fuzzy搜索客户名称和部分地址 [英] Lucene Fuzzy Search for customer names and partial address

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭