Hibernate搜索:如何正确使用通配符? [英] Hibernate Search: How to use wildcards correctly?

查看:434
本文介绍了Hibernate搜索:如何正确使用通配符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下查询以全名搜索患者,针对特定的医疗中心:

  MustJunction mj = qb。 bool()。must(qb.keyword()
.onField(medicalCenter.id)
.matching(medicalCenter.getId())
.createQuery());
for(String term:terms)
if(!term.equals())
mj.must(qb.keyword()
.onField(fullName)
.matching(term +*)
.createQuery());

它工作的很完美,但只有当用户输入完整的名字和/或姓氏患者。



然而,即使用户输入名字或姓氏的一部分,我也想让其工作。

$例如,如果有一名患者叫做Bilbo Baggins,当用户输入Bilbo Baggins,Bilbo,Baggins,或即使他只输入Bil或Bag 为了实现这一点,我修改了上述查询,如下所示:

  MustJunction mj = qb.bool()。must(qb.keyword()
.onField(medicalCenter.id)
。匹配(medicalCenter.getId())
.createQuery());
for(String term:terms)
if(!term.equals())
mj.must(qb.keyword()
.wildcard()
.onField(fullName)
.matching(term +*)
.createQuery());

注意我在调用onField()之前添加了wildcard()函数()



但是,这会中断搜索并返回任何结果。我做错了什么?

解决方案

简短回答:不要使用通配符查询,使用自定义分析器和 EdgeNGramAnalyzerFactory 。另外,不要试图自己分析查询(这就是你将查询拆分成条款所做的):Lucene会做得更好(使用 WhitespaceTokenizerFactory ASCIIFoldingFilterFactory ,特别是 LowercaseFilterFactory )。



通配符查询对于解决一次性问题很快且简单,但它们不是非常灵活并且很快达到极限。特别是,像@femtoRgon提到的那样,这些查询不会被分析,所以大写的查询不会匹配小写的名字,例如。

大多数问题的经典解决方案在Lucene世界中是在索引时间和查询时间使用特制的分析器(不一定相同)。在你的情况下,你会希望在索引时使用这种分析器:

  @AnalyzerDef(name =edgeNgram,
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),//用其较简单的对应字符替换重音字符(è=> e等)
@TokenFilterDef(factory = LowerCaseFilterFactory.class),//小写所有字符
@TokenFilterDef(
factory = EdgeNGramFilterFactory.class,//生成前缀标记
params = {
@Parameter(name =minGramSize,value =1),
@Parameter(name =maxGramSize,value =10)
}

})

这种查询方式:

  @AnalyzerDef(name =edgeNGram_query, 
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),//用更简单的对应字符替换重音字符(è=> ; e等)
@TokenFilterDef(factory = LowerCaseFilterFactory.class)//小写所有字符
})

指数分析器会将Mauricio Ubilla Carvajal转换为这些令牌列表:


  • m
  • >
  • ma

  • mau

  • maur

  • mauri

  • mauric

  • maurici

  • mauricio

  • b
  • ub

  • ...

  • ubilla

  • c

  • ca

  • ...

  • carvajal



然后查询分析器会将查询mau UB变成[mau,ub],它将与索引名称匹配(索引中存在两个标记)。

请注意,您显然必须将分析仪分配到现场。对于索引部分,它使用 @Analyzer 注释
对于查询部分,您必须在查询构建器上使用 overridesForField ,如这里


I have the following query to search patients by full name, for an specific medical center:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField("medicalCenter.id")
    .matching(medicalCenter.getId())
    .createQuery());
for(String term: terms)
    if(!term.equals(""))
       mj.must(qb.keyword()
       .onField("fullName")
       .matching(term+"*")
       .createQuery());

And it is working perfectly, but only if the user types the full firstname and/or lastname of the patient.

However I would like to make if work even if the user types a part of the firstname or lastname.

For example, if there's a patient called "Bilbo Baggins" I would like the search to find him, when the user types "Bilbo Baggins, "Bilbo, "Baggins", or even if he only types "Bil" or "Bag"

To achieve this I modified the above query as follows:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField("medicalCenter.id")
    .matching(medicalCenter.getId())
    .createQuery());
for(String term: terms)
    if(!term.equals(""))
       mj.must(qb.keyword()
       .wildcard()
       .onField("fullName")
       .matching(term+"*")
       .createQuery());

Note how I added the wildcard() function before the call to onField()

However, this breaks the search and returns no results. What am I doing wrong?

解决方案

Short answer: don't use wildcard queries, use a custom analyzer with an EdgeNGramAnalyzerFactory. Also, don't try to analyze the query yourself (that's what you did by splitting the query into terms): Lucene will do it much better (with a WhitespaceTokenizerFactory, an ASCIIFoldingFilterFactory and a LowercaseFilterFactory in particular).

Long answer:

Wildcard queries are useful as quick and easy solutions to one-time problems, but they are not very flexible and reach their limits quite quickly. In particular, as @femtoRgon mentioned, these queries are not analyzed, so an uppercase query won't match a lowercase name, for instance.

The classic solution to most problems in the Lucene world is to use specially-crafted analyzers at index time and query time (not necessarily the same). In your case, you will want to use this kind of analyzer when indexing:

@AnalyzerDef(name = "edgeNgram",
    tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
            @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
            @TokenFilterDef(factory = LowerCaseFilterFactory.class), // Lowercase all characters
            @TokenFilterDef(
                    factory = EdgeNGramFilterFactory.class, // Generate prefix tokens
                    params = {
                            @Parameter(name = "minGramSize", value = "1"),
                            @Parameter(name = "maxGramSize", value = "10")
                    }
            )
    })

And this kind when querying:

@AnalyzerDef(name = "edgeNGram_query",
    tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
            @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
            @TokenFilterDef(factory = LowerCaseFilterFactory.class) // Lowercase all characters
    })

The index analyzer will transform "Mauricio Ubilla Carvajal" to this list of tokens:

  • m
  • ma
  • mau
  • maur
  • mauri
  • mauric
  • maurici
  • mauricio
  • u
  • ub
  • ...
  • ubilla
  • c
  • ca
  • ...
  • carvajal

And the query analyzer will turn the query "mau UB" into ["mau", "ub"], which will match the indexed name (both tokens are present in the index).

Note that you'll obviously have to assign the analyzer to the field. For the indexing part, it's done using the @Analyzer annotation. For the query part, you'll have to use overridesForField on the query builder as shown here

这篇关于Hibernate搜索:如何正确使用通配符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆