Spring数据弹性搜索通配符搜索 [英] Spring data elastic search wild card search

查看:60
本文介绍了Spring数据弹性搜索通配符搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在下面的文本列表中搜索<​​strong>蓝色

I am trying to search for the word blue in the below list of text

"BlueSaphire","Bluo","alue","blue","BLUE",蓝",蓝黑",蓝",蓝宝石蓝",黑色",绿色","bloo","Saphireblue"

"BlueSaphire","Bluo","alue","blue", "BLUE", "Blue","Blue Black","Bluo","Saphire Blue", "black" , "green","bloo" , "Saphireblue"

SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices("color")
                  .withQuery(matchQuery("colorDescriptionCode", "blue")
                  .fuzziness(Fuzziness.ONE)
                  )
                  .build();

这很好,搜索结果将返回以下记录以及得分

This works fine and the search result returns the below records along with the scores

alue    2.8718023
Bluo    1.7804208
Bluo    1.7804208
BLUE    1.2270637
blue    1.2270637
Blue    1.2270637
Blue Black    1.1082436
Saphire Blue    0.7669148

但是我无法使通配符起作用."SaphireBlue"和"BlueSaphire"也有望成为结果的一部分

But I am not able to make wild card work . "SaphireBlue" and "BlueSaphire" is also expected to be part of the result

我尝试了以下设置,但不起作用.

I tried the below setting but it does not work .

SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices("color")
                      .withQuery(matchQuery("colorDescriptionCode", "(.*?)blue")
                      .fuzziness(Fuzziness.ONE)
                      )
                      .build();

在堆栈溢出中,我观察到一种指定分析通配符的解决方案.

In stack overflow , I observed a solution to specify analyze wild card .

QueryBuilder queryBuilder = boolQuery().should(
                queryString("blue").analyzeWildcard(true)
                        .field("colorDescriptionCode", 2.0f);

我找不到queryString静态方法.我正在使用spring-data-elasticsearch 2.0.0.RELEASE.

I dont find the queryString static method . I am using spring-data-elasticsearch 2.0.0.RELEASE .

让我知道如何指定通配符,以便所有包含 blue 的单词也将在搜索结果中返回

Let me know how i can specify the wild card so the all words containing blue will also be returned in the search results

推荐答案

我知道工作示例总是比理论更好,但是,我仍然想先讲一点理论.Elasticsearch的核心是Lucene.因此,在将文档写入Lucene索引之前,他要经过分析阶段.分析阶段可以分为三个部分:

I know that working examples are always better than theory, but still, I would first like to tell a little theory. The heart of the Elasticsearch is Lucene. So before document will be written to Lucene index, he goes through analysis stage. The analysis stage can be divided into 3 parts:

  1. 字符过滤
  2. 令牌化
  3. 令牌过滤

在第一阶段,我们可以丢弃不需要的字符,例如HTML标签.有关字符过滤器的更多信息,您可以在官方网站.下一阶段要有趣得多.在这里,我们将输入文本拆分为标记,这些标记将在以后用于搜索.一些非常有用的 tokenizers :

In the first stage, we can throw away unwanted characters, for example, HTML tags. More information about character filters, you can find on official site. Next stage is far more interesting. Here we split input text to tokens, which will be used later for searching. A few very useful tokenizers:

  • 标准令牌生成器.默认情况下使用.分词器实现Unicode文本分段算法.在实践中,您可以使用它来将文本拆分为单词,然后将此单词用作标记.
  • n-gram标记器.如果您想按单词的一部分进行搜索,这就是您所需要的.该标记器将文本拆分为n个项目的连续序列.例如,文本例如"将被拆分为以下标记序列"fo","or","r","e","ex","for","or"或n-gram的长度是可变的,可以通过min_gram和max_gram参数进行配置.
  • 边缘n-gram标记器.除了一件事外,其工作方式与n-gram标记生成器相同-该标记生成器不增加偏移量.例如,文本例如"将被拆分为以下标记序列"fo","for","for","for e","for ex","for exa" 等.您可以在官方网站上找到有关分词器的更多信息.不幸的是,由于声誉低下,我无法发布更多链接.
  • standard tokenizer. It's used by default. The tokenizer implements the Unicode Text Segmentation algorithm. In practice, you can use this to split the text into words and use this words as tokens.
  • n-gram tokenizer. This is what you need if you want to search by part of the word. This tokenizer splits text to a contiguous sequence of n items. For example text "for example" will be splitted to this sequence of tokens "fo", "or", "r ", " e", "ex", "for", "or ex" etc. The length of n-gram is variable and can be configured by min_gram and max_gram params.
  • edge n-gram tokenizer. Work the same as n-gram tokenizer except for one thing - this tokenizer doesn't increment offset. For example text "for example" will be splitted to this sequence of tokens "fo", "for", "for ", "for e", "for ex", "for exa" etc. More information about tokenizers you can find on the official site. Unfortunately, I can't post more links because of low reputation.

下一阶段也很有趣.将文本拆分为标记后,我们可以执行很多有趣的操作.我再次给出一些非常有用的令牌过滤器示例:

The next stage is also damn interesting. After we split text to tokens, we can do a lot of interesting things with this. Again I give a few very useful examples of token filters:

  • 小写过滤器.在大多数情况下,我们希望获得不区分大小写的搜索,因此,最好将标记改为小写.
  • 干式过滤器.当我们处理自然语言时,就会遇到很多问题.问题之一是一个词可以有多种形式.词干过滤器可帮助我们获得单词的词根形式.
  • 模糊性过滤器.另一个问题是用户经常打错字.此过滤器会添加包含可能输入错误的令牌.
  • lowercase filter. In most cases, we want to get case-insensitive search, so it's good practice to bring tokens to lowercase.
  • stemmer filter. When we have a deal with natural language, we have a lot of problems. One of the problem is that one word can have many forms. Stemmer filter helps us to get root form of the word.
  • fuzziness filter. Another problem is that users often make typos. This filter adds tokens that contain possible typos.

如果您有兴趣查看分析器的结果,则可以使用此_termvectors端点

If you are interested in looking at the result of the analyzer, you can use this _termvectors endpoint

curl [ELASTIC_URL]:9200/[INDEX_NAME]/[TYPE_NAME]/[DOCUMENT_ID]/_termvectors?pretty

现在谈论查询.查询分为两个大组.这些组有2个显着差异:

Now talk about queries. Queries are divided into 2 large groups. These groups have 2 significant differences:

  1. 请求是否经过分析阶段;
  2. 请求是否要求确切答案(是或否)

例如匹配查询和术语查询.第一个将通过分析阶段,第二个则不会.第一个不会给我们一个具体的答案(但是给我们一个分数),第二个会给我们答案.为文档创建映射时,我们可以在每个字段中分别指定分析器的索引和搜索分析器的索引.

Examples are the match query and term query. The first will pass the stage of analysis, the second not. The first will not give us a specific answer (but give us a score), the second will does. When creating mappings for a document, we can specify both the index of the analyzer and the search analyzer separately per field.

有关弹簧数据弹性搜索的信息.在这里谈论具体的例子是有意义的.假设我们有一个带有标题字段的文档,并且我们想搜索有关此字段的信息.首先,创建一个带有用于Elasticsearch设置的文件.

Now information regarding spring data elasticsearch. Here it makes sense to talk about concrete examples. Suppose that we have a document with a title field and we want to search for information on this field. First, create a file with settings for elasticsearch.

{
 "analysis": {
    "analyzer": {
        "ngram_analyzer": {
            "tokenizer": "ngram_tokenizer",
            "filter": [
                "lowercase"
            ]
        },
        "edge_ngram_analyzer": {
            "tokenizer": "edge_ngram_tokenizer",
            "filter": [
                "lowercase"
            ]
        },
        "english_analyzer": {
            "tokenizer": "standard",
            "filter": [
                "lowercase",
                "english_stop",
                "unique",
                "english_possessive_stemmer",
                "english_stemmer"
            ]
        "keyword_analyzer": {
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }

   },
   "tokenizer": {
       "ngram_tokenizer": {
           "type": "ngram",
           "min_gram": 2,
           "max_gram": 20
       },
       "edge_ngram_tokenizer": {
           "type": "edge_ngram",
           "min_gram": 2,
           "max_gram": 20
       }
   },
   "filter": {
       "english_stop": {
           "type": "stop",
           "stopwords": "_english_"
       },
   "english_stemmer": {
       "type": "stemmer",
       "language": "english"
   },
   "english_possessive_stemmer": {
       "type": "stemmer",
       "language": "possessive_english"
   }
 }    
}

您可以将此设置保存到资源文件夹中.现在,让我们看一下我们的文档类

You can save this settings to your resource folder. Now let's see to our document class

@Document(indexName = "document", type = "document")
@Setting(settingPath = "document_index_setting.json")
public class Document {

    @Id
    private String id;

    @MultiField(
        mainField = @Field(type = FieldType.String, 
                           index = not_analyzed),
        otherFields = {
                @InnerField(suffix = "edge_ngram",
                        type = FieldType.String,
                        indexAnalyzer = "edge_ngram_analyzer",
                        searchAnalyzer = "keyword_analyzer"),
                @InnerField(suffix = "ngram",
                        type = FieldType.String,
                        indexAnalyzer = "ngram_analyzer"),
                        searchAnalyzer = "keyword_analyzer"),
                @InnerField(suffix = "english",
                        type = FieldType.String,
                        indexAnalyzer = "english_analyzer")
        }
    )
    private String title;

    // getters and setters omitted

}

因此,这里的字段标题具有三个内部字段:

So here field title with three inner fields:

  • title.edge_ngram 用于使用关键字搜索分析器按边缘n-gram进行搜索.我们需要这样做是因为我们不需要将查询拆分为边缘n-gram;
  • title.ngram 用于按n-gram进行搜索;
  • title.english 用于使用自然语言的细微差别进行搜索和主要领域的标题.我们不对此进行分析,因为有时我们希望按此字段进行排序.让我们使用简单的多匹配查询来搜索所有这些字段:
  • title.edge_ngram for searching by edge n-grams with keyword search analyzer. We need this because we don't need that our query be splitted to edge n-grams;
  • title.ngram for searching by n-grams;
  • title.english for searching with the nuances of a natural language And main field title. We don't analyze this because sometimes we want to sort by this field. Let's use simple multi match query for searching through all this fields:
String searchQuery = "blablabla";
MultiMatchQueryBuilder queryBuilder = multiMatchQuery(searchQuery)
    .field("title.edge_ngram", 2)
    .field("title.ngram")
    .field("title.english");
NativeSearchQueryBuilder searchBuilder = new NativeSearchQueryBuilder()
    .withIndices("document")
    .withTypes("document")
    .withQuery(queryBuilder)
    .withPageable(new PageRequest(page, pageSize));
elasticsearchTemplate.queryForPage(searchBuilder.build, 
                                   Document.class, 
                                   new SearchResultMapper() {
                                   //realisation omitted });

搜索是一个非常有趣且庞大的主题.我试图尽可能简短地回答,有可能因为这个原因而导致混乱的时刻-不要犹豫了.

Search is a very interesting and voluminous topic. I tried to answer as briefly as possible, it is possible that because of this there were confusing moments - do not hesitate to ask.

这篇关于Spring数据弹性搜索通配符搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆