休眠搜索|带有minGramSize 1的ngram分析器 [英] Hibernate Search | ngram analyzer with minGramSize 1

查看:89
本文介绍了休眠搜索|带有minGramSize 1的ngram分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Hibernate Search分析器配置存在一些问题. 我的一个索引实体(医院")具有一个String字段("name"),该字段可能包含长度为1至40的值.我希望能够仅通过搜索一个字符来查找实体(因为医院可能只有一个字符名称).

I have some problems with my Hibernate Search analyzer configuration. One of my indexed entities ("Hospital") has a String field ("name") that could contain values with lengths from 1-40. I want to be able to find a entity by searching for just one character (because it could be possible, that a hospital has single character name).

@Indexed(index = "HospitalIndex")
@AnalyzerDef(name = "ngram",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = StandardFilterFactory.class),
                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                @TokenFilterDef(factory = NGramFilterFactory.class,
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = "40")})
        })
public class Hospital {

        @Field(index = Index.YES, analyze = Analyze.YES, store = Store.NO, analyzer = @Analyzer(definition = "ngram"))
        private String name = "";
}

如果我添加名为我的测试医院"的医院,则Lucene索引应如下所示:

If I add a hospital with name "My Test Hospital" the Lucene index looks like this:

1   name    al
1   name    e
1   name    es
1   name    est
1   name    h
1   name    ho
1   name    hos
1   name    hosp
1   name    hospi
1   name    hospit
1   name    hospita
1   name    hospital
1   name    i
1   name    it
1   name    ita
1   name    ital
1   name    l
1   name    m
1   name    my
1   name    o
1   name    os
1   name    osp
1   name    ospi
1   name    ospit
1   name    ospita
1   name    ospital
1   name    p
1   name    pi
1   name    pit
1   name    pita
1   name    pital
1   name    s
1   name    sp
1   name    spi
1   name    spit
1   name    spita
1   name    spital
1   name    st
1   name    t
1   name    ta
1   name    tal
1   name    te
1   name    tes
1   name    test
1   name    y
1   name    a

这是我构建和执行搜索查询的方式:

This is how I build and execute my search query:

QueryBuilder hospitalQb = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Hospital.class).get();
Query hospitalQuery = hospitalQb.keyword().onFields("name")().matching(searchString).createQuery();
javax.persistence.Query persistenceQuery = fullTextEntityManager.createFullTextQuery(hospitalQuery, Hospital.class);
List<Hospital> results = persistenceQuery.getResultList();  

问题在于,同一ngram分析器也用于我的搜索查询.因此,当我搜索医院"时,我会发现所有名称中包含"a"字符的医院. 当我在其上调用toString方法时,这就是搜索查询的样子:

The problem is that the same ngram analyzer is also used for my search query. So when I am search for example for "hospital" I will find all hospitals that contains a "a"-character in the name. This is how the search query looks likes, when I call the toString method on it:

name:h name:ho name:hos name:hosp name:hospi name:hospit name:hospita name:hospital name:o name:os name:osp name:ospi name:ospit name:ospita name:ospital name:s name:sp name:spi name:spit name:spita name:spital name:p name:pi name:pit name:pita name:pital name:i name:it name:ita name:ital name:t name:ta name:tal name:a name:al name:l

问题是,有人知道更好的分析仪配置,还是以其他方式构建可以解决问题的搜索查询?

So the question is, does anybody know a better analyzer configuration or another way build the search query that solves the problem?

推荐答案

您可以设置第二个分析器,除了没有ngram过滤器外,其余相同,然后覆盖用于查询的分析器:

You can set up a second analyzer, identical except that it does not have an ngram filter, and then override the analyzer used for queries:

QueryBuilder hospitalQb = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Hospital.class)
    .overridesForField( "name", "my_analyzer_without_ngrams" )
    .get();
// Then it's business as usual


此外,如果要实现某种自动完成(foo*),而不是词内搜索(*foo*),则可能要使用EdgeNGramFilterFactory而不是NGramFilterFactory:它只会生成作为索引标记前缀的ngram.


Additionally, if you are implementing some kind of auto-completion (foo*), and not in-word search (*foo*), you may want to use EdgeNGramFilterFactory instead of NGramFilterFactory: it will only generate ngrams that are prefixes of the indexed tokens.

这篇关于休眠搜索|带有minGramSize 1的ngram分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆