ngram令牌过滤器与ngram令牌过滤器有何不同? [英] how edge ngram token filter differs from ngram token filter?

查看:604
本文介绍了ngram令牌过滤器与ngram令牌过滤器有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我是弹性搜索的新手,我无法识别 ngram令牌过滤器
边缘ngram令牌过滤器之间的区别。

As I am new to elastic search, I am not able to identify difference between ngram token filter and edge ngram token filter.


处理令牌中,这两个不同之处如何?

How these two differ from each other in processing tokens?

推荐答案

我认为文档是非常清楚的:


这个分类器非常类似于nGram,但只保留从一开始就开始的n-gram令牌。

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

nGram tokenizer的最佳示例再次来自< a href =https://www.elastic.co/guide/en/elasticsearch/reference/1.6/analysis-ngram-tokenizer.html =noreferrer>文档:

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'


    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

使用此分类器定义:

                    "type" : "nGram",
                    "min_gram" : "2",
                    "max_gram" : "3",
                    "token_chars": [ "letter", "digit" ]

简而言之:


  • 根据配置,tokenizer将创建令牌。在这个例子中: FC Schalke 04 / li>
  • nGram 生成最小 min_gram 大小和最大 max_gram 大小从输入文本。基本上,令牌被分割成小块,每个块都被固定在一个角色上(这个角色无关紧要,所有这些都会创建块)。

  • edgeNGram 执行相同操作,但这些块总是从每个令牌的开头开始。基本上,这些块被固定在标记的开头。

  • the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
  • nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
  • edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

对于与上述相同的文本, edgeNGram 生成: FC,Sc,Sch,Scha,Schal,04 。考虑文本中的每个单词,对于每个单词,第一个字符是起始点( F FC Schalke 0 04 )。

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

这篇关于ngram令牌过滤器与ngram令牌过滤器有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆