Hibernate Search:如何正确使用通配符? [英] Hibernate Search: How to use wildcards correctly?

查看:29
本文介绍了Hibernate Search:如何正确使用通配符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下查询可以按全名搜索特定医疗中心的患者:

I have the following query to search patients by full name, for an specific medical center:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField("medicalCenter.id")
    .matching(medicalCenter.getId())
    .createQuery());
for(String term: terms)
    if(!term.equals(""))
       mj.must(qb.keyword()
       .onField("fullName")
       .matching(term+"*")
       .createQuery());

它运行良好,但前提是用户输入患者的完整名字和/或姓氏.

And it is working perfectly, but only if the user types the full firstname and/or lastname of the patient.

但是,即使用户输入名字或姓氏的部分,我也想让 if 工作.

However I would like to make if work even if the user types a part of the firstname or lastname.

例如,如果有一个名为Bilbo Baggins"的患者,我希望搜索能找到他,当用户输入Bilbo Baggins"、Bilbo"、Baggins"、或者即使他只输入Bilbo Baggins""或包"

For example, if there's a patient called "Bilbo Baggins" I would like the search to find him, when the user types "Bilbo Baggins, "Bilbo, "Baggins", or even if he only types "Bil" or "Bag"

为了实现这一点,我修改了上面的查询如下:

To achieve this I modified the above query as follows:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField("medicalCenter.id")
    .matching(medicalCenter.getId())
    .createQuery());
for(String term: terms)
    if(!term.equals(""))
       mj.must(qb.keyword()
       .wildcard()
       .onField("fullName")
       .matching(term+"*")
       .createQuery());

注意我是如何在调用 onField() 之前添加通配符() 函数的

Note how I added the wildcard() function before the call to onField()

然而,这会破坏搜索并且不返回任何结果.我做错了什么?

However, this breaks the search and returns no results. What am I doing wrong?

推荐答案

更新了 Hibernate Search 6 的答案

简短回答:不要使用通配符查询,使用带有 EdgeNGramFilterFactory 的自定义分析器.另外,不要尝试自己分析查询(这是您通过将查询拆分为术语所做的):Lucene 会做得更好(使用 WhitespaceTokenizerFactoryASCIIFoldingFilterFactory> 和一个 LowercaseFilterFactory 特别是).

Updated answer for Hibernate Search 6

Short answer: don't use wildcard queries, use a custom analyzer with an EdgeNGramFilterFactory. Also, don't try to analyze the query yourself (that's what you did by splitting the query into terms): Lucene will do it much better (with a WhitespaceTokenizerFactory, an ASCIIFoldingFilterFactory and a LowercaseFilterFactory in particular).

长答案:

通配符查询作为一次性问题的快速简便的解决方案非常有用,但它们不是很灵活并且很快就会达到极限.特别是,正如@femtoRgon 提到的,这些查询没有被分析(至少 不完全,也不是每个后端),因此例如大写查询将不匹配小写名称.

Wildcard queries are useful as quick and easy solutions to one-time problems, but they are not very flexible and reach their limits quite quickly. In particular, as @femtoRgon mentioned, these queries are not analyzed (at least not completely, and not with every backend), so an uppercase query won't match a lowercase name, for instance.

Lucene/Elasticsearch 世界中大多数问题的经典解决方案是在索引时和查询时使用特制的分析器(不一定相同).在您的情况下,您将需要使用这种分析器(一个用于索引,一个用于搜索):

The classic solution to most problems in the Lucene/Elasticsearch world is to use specially-crafted analyzers at index time and query time (not necessarily the same). In your case, you will want to use this kind of analyzer (one for indexing, one for searching):

Lucene:

public class MyAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer( "autocomplete_indexing" ).custom()
                .tokenizer( WhitespaceTokenizerFactory.class )
                // Lowercase all characters
                .tokenFilter( LowerCaseFilterFactory.class )
                // Replace accented characters by their simpler counterpart (è => e, etc.)
                .tokenFilter( ASCIIFoldingFilterFactory.class )
                // Generate prefix tokens
                .tokenFilter( EdgeNGramFilterFactory.class )
                        .param( "minGramSize", "1" )
                        .param( "maxGramSize", "10" );
        // Same as "autocomplete-indexing", but without the edge-ngram filter
        context.analyzer( "autocomplete_search" ).custom()
                .tokenizer( WhitespaceTokenizerFactory.class )
                // Lowercase all characters
                .tokenFilter( LowerCaseFilterFactory.class )
                // Replace accented characters by their simpler counterpart (è => e, etc.)
                .tokenFilter( ASCIIFoldingFilterFactory.class );
    }
}

弹性搜索:

public class MyAnalysisConfigurer implements ElasticsearchAnalysisConfigurer {
    @Override
    public void configure(ElasticsearchAnalysisConfigurationContext context) {
        context.analyzer( "autocomplete_indexing" ).custom()
                .tokenizer( "whitespace" )
                .tokenFilters( "lowercase", "asciifolding", "autocomplete_edge_ngram" );
        context.tokenFilter( "autocomplete_edge_ngram" )
                .type( "edge_ngram" )
                .param( "min_gram", 1 )
                .param( "max_gram", 10 );
        // Same as "autocomplete_indexing", but without the edge-ngram filter
        context.analyzer( "autocomplete_search" ).custom()
                .tokenizer( "whitespace" )
                .tokenFilters( "lowercase", "asciifolding" );
    }
}

索引分析器将转换Mauricio Ubilla Carvajal"到这个令牌列表:

The indexing analyzer will transform "Mauricio Ubilla Carvajal" to this list of tokens:

  • 毛里
  • 毛利克
  • 毛里西
  • 毛里西奥
  • ub
  • ...
  • 乌比拉
  • c
  • ca
  • ...
  • 卡瓦哈尔

并且查询分析器会将查询mau UB"转为进入[mau",ub"],它将匹配索引名称(两个标记都存在于索引中).

And the query analyzer will turn the query "mau UB" into ["mau", "ub"], which will match the indexed name (both tokens are present in the index).

请注意,您显然必须将分析器分配给该字段.在 Hibernate Search 6 中很容易,因为你可以 searchAnalyzer 分配给字段,与索引分析器分开:

Note that you'll obviously have to assign the analyzers to the field. In Hibernate Search 6 it's easy, as you can assign a searchAnalyzer to a field, separately from the indexing analyzer:

@FullTextField(analyzer = "autocomplete_indexing", searchAnalyzer = "autocomplete_search")

然后你可以很容易地搜索,比如,simpleQueryString 谓词:

Then you can easily search with, say, a simpleQueryString predicate:

List<Patient> hits = searchSession.search( Patient.class )
        .where( f -> f.simpleQueryString().field( "fullName" )
                .matching( "mau + UB" ) )
        .fetchHits( 20 );

或者如果你不需要额外的语法和运算符,一个 match predicate 应该做的:

Or if you don't need extra syntax and operators, a match predicate should do:

List<Patient> hits = searchSession.search( Patient.class )
        .where( f -> f.match().field( "fullName" )
                .matching( "mau UB" ) )
        .fetchHits( 20 );


Hibernate Search 5 的原始答案

简短回答:不要使用通配符查询,使用带有 EdgeNGramFilterFactory 的自定义分析器.另外,不要尝试自己分析查询(这是您通过将查询拆分为术语所做的):Lucene 会做得更好(使用 WhitespaceTokenizerFactoryASCIIFoldingFilterFactory> 和一个 LowercaseFilterFactory 特别是).


Original answer for Hibernate Search 5

Short answer: don't use wildcard queries, use a custom analyzer with an EdgeNGramFilterFactory. Also, don't try to analyze the query yourself (that's what you did by splitting the query into terms): Lucene will do it much better (with a WhitespaceTokenizerFactory, an ASCIIFoldingFilterFactory and a LowercaseFilterFactory in particular).

长答案:

通配符查询作为一次性问题的快速简便的解决方案非常有用,但它们不是很灵活并且很快就会达到极限.特别是,正如@femtoRgon 提到的,这些查询不会被分析,因此大写查询不会匹配小写名称,例如.

Wildcard queries are useful as quick and easy solutions to one-time problems, but they are not very flexible and reach their limits quite quickly. In particular, as @femtoRgon mentioned, these queries are not analyzed, so an uppercase query won't match a lowercase name, for instance.

Lucene 世界中大多数问题的经典解决方案是在索引时和查询时(不一定相同)使用特制的分析器.在您的情况下,您将希望在索引时使用这种分析器:

The classic solution to most problems in the Lucene world is to use specially-crafted analyzers at index time and query time (not necessarily the same). In your case, you will want to use this kind of analyzer when indexing:

    @AnalyzerDef(name = "edgeNgram",
        tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
                @TokenFilterDef(factory = LowerCaseFilterFactory.class), // Lowercase all characters
                @TokenFilterDef(
                        factory = EdgeNGramFilterFactory.class, // Generate prefix tokens
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = "10")
                        }
                )
        })

查询时的这种:

@AnalyzerDef(name = "edgeNGram_query",
    tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
            @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
            @TokenFilterDef(factory = LowerCaseFilterFactory.class) // Lowercase all characters
    })

索引分析器将转换Mauricio Ubilla Carvajal"到这个令牌列表:

The index analyzer will transform "Mauricio Ubilla Carvajal" to this list of tokens:

  • 毛里
  • 毛利克
  • 毛里西
  • 毛里西奥
  • ub
  • ...
  • 乌比拉
  • c
  • ca
  • ...
  • 卡瓦哈尔

并且查询分析器会将查询mau UB"转为进入[mau",ub"],它将匹配索引名称(两个标记都存在于索引中).

And the query analyzer will turn the query "mau UB" into ["mau", "ub"], which will match the indexed name (both tokens are present in the index).

请注意,您显然必须将分析器分配给该字段.对于索引部分,它使用 @Analyzer 注释.对于查询部分,您必须在查询构建器上使用 overridesForField,如此处所示:

Note that you'll obviously have to assign the analyzer to the field. For the indexing part, it's done using the @Analyzer annotation. For the query part, you'll have to use overridesForField on the query builder as shown here:

QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Hospital.class)
    .overridesForField( "name", "edgeNGram_query" )
    .get();
// Then it's business as usual

另请注意,在 Hibernate Search 5 中,Elasticsearch 分析器定义仅在实际分配给索引时才由 Hibernate Search 生成.因此默认情况下不会生成查询分析器定义,并且 Elasticsearch 会抱怨它不知道分析器.这是一个解决方法:https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4?u=yrodiere

Also note that, in Hibernate Search 5, Elasticsearch analyzer definitions are only generated by Hibernate Search if they are actually assigned to an index. So the query analyzer definition will not, by default, be generated, and Elasticsearch will complain that it does not know the analyzer. Here is a workaround: https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4?u=yrodiere

这篇关于Hibernate Search:如何正确使用通配符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆