Lucene Fuzzy搜索客户名称和部分地址 [英] Lucene Fuzzy Search for customer names and partial address

查看:88
本文介绍了Lucene Fuzzy搜索客户名称和部分地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将浏览所有现有的问题帖子,但没有任何相关性.

I was going thru all the existing questions posts but couldn't get something much relevant.

我的档案中有数百万条记录,记录了人的名字,姓氏,地址1,地址2,国家/地区代码,出生日期-我想每天查看具有上述文件的客户列表(我的客户列表也会每天更新,文件也每天更新).

I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).

对于名字和姓氏,我想进行模糊匹配(可能是lucene Fuzzyquery/levenshtein距离为90%匹配),对于其余字段,国家和出生日期,我希望完全匹配.

For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.

我是Lucene的新手,但是通过查看帖子数,似乎很可能.

I am new to Lucene, but by looking at number of posts, looks like its possible.

我的问题是:

  • 我应该如何为输入文件建立索引?我需要基于FN,LN,国家/地区,DOB的组合来建立索引,并使用该索引进行搜索
  • 如何在此处使用Lucene的模糊查询?

还有其他方法可以实现相同的功能吗?

Is there any other way I can implement the same?

推荐答案

Rushik,这里有一些想法:

Rushik, here are a few ideas:

  • 考虑使用 Solr .与裸露的Lucene相比,开始使用它要容易得多.
  • 构建文件的Lucene/Solr索引.如果您使用多值字段或两个不同的字段作为地址,则似乎每个客户的文档就足够了.
  • 每个人都有唯一的ID吗?要使用Solr,您需要一个.在Lucene中,您无需使用唯一的ID就可以逃脱.
  • 将国家/地区代码存储为关键字".如果您只要求完全匹配出生日期,则可以执行相同的操作.对于范围查询,您将需要其他表示形式.
  • 我认为您的客户列表小于该文件.可能的策略是每天为文件中的更改编制索引(在这里,唯一的ID确实很方便-否则您需要通过查询删除,这可能会遗漏标记).然后,您可以优化索引,然后搜索更新的客户列表.
  • 您描述的是一个 BooleanQuery ,Whose子句是对名字和姓氏的模糊查询,对其他字段是词条查询.您可以以编程方式创建查询,也可以使用查询解析器.
  • 考虑在此处中所述的名称使用soundex.
  • Consider using Solr. It is much easier to start using it, rather than bare Lucene.
  • Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
  • Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
  • Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
  • I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
  • What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
  • Consider using soundex for names as described here.

这篇关于Lucene Fuzzy搜索客户名称和部分地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆