使用Edge N Gram分析器和char过滤器创建分析器，用新行替换空间 [英] Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

查看：90 发布时间：2021/5/3 20:32:52 java elasticsearch lucene tokenize elasticsearch-analyzers

本文介绍了使用Edge N Gram分析器和char过滤器创建分析器，用新行替换空间的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我收到以下类型的文字. foo bar ， hello world 等.我使用Edge NGram令牌生成器创建了一个分析器，并使用它在令牌下创建的分析API.

I have below type of text coming in. foo bar, hello world etc. I created an analyzer using Edge NGram tokenizer and using the analyze api it creates below token.

{
  "tokens": [
    {
      "token": "f",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 1
    },
    {
      "token": "fo",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "foo",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "b",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    },
    {
      "token": "ba",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 5
    },
    {
      "token": "bar",
      "start_offset": 4,
      "end_offset": 7,
      "type": "word",
      "position": 6
    }
  ]
}

但是当我在我的代码中将文本"foo bar"传递给方法 tokenStream 时，它会在下面为 foo bar 创建令牌.

But when in my code I pass the text "foo bar" to method tokenStream, it create below tokens for foo bar.

f，fo，foo，foo，foo b，foo ba，foo bar.

f, fo, foo, foo , foo b, foo ba, foo bar.

这导致 analyze api返回的令牌不匹配.我想知道如何添加一个char过滤器以删除文本中的空格，并对文本中的各个术语应用Edge NGram标记器.

This is causing the mismatch in the tokens returned by analyze api. I want to know how can I add a char filter which removes the space in the text and apply Edge NGram tokenizer on individual terms in the text.

因此，在 foo bar 示例中，它应在令牌下面创建.当我调用 tokenStream 方法时.

So, In the foo bar example, it should create below token. when I call tokenStream method.

f，fo，foo，b，ba，bar

f, fo, foo, b, ba, bar.

我尝试将char过滤器添加到创建分析器的Java代码中.下面是它的代码.

I tried adding the char filter to my java code of create the analyzer. Below is the code of it.

@Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        NormalizeCharMap normalizeCharMap = new NormalizeCharMap();
        normalizeCharMap.add(" ", "\\u2424");
        Reader replaceDots = new MappingCharFilter(normalizeCharMap, reader);
        TokenStream result = new EdgeNGramTokenizer(replaceDots, EdgeNGramTokenizer.DEFAULT_SIDE, 1, 30);
        return result;
    }

但是它需要 lu2424 .还请让我知道我的分析器代码是否正确?

But it takes lu2424 as it as. Also please let me know if my code of Analyzer is correct or not?

使用Edge N Gram分析器和char过滤器创建分析器，用新行替换空间 [英] Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用Edge N Gram分析器和char过滤器创建分析器，用新行替换空间 [英] Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭