使用Edge N Gram分析器和char过滤器创建分析器,用新行替换空间 [英] Create analyzer with Edge N Gram analyzer and char filter which replaces space with new line

查看:90
本文介绍了使用Edge N Gram分析器和char过滤器创建分析器,用新行替换空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到以下类型的文字. foo bar hello world 等.我使用Edge NGram令牌生成器创建了一个分析器,并使用它在令牌下创建的分析API.

I have below type of text coming in. foo bar, hello world etc. I created an analyzer using Edge NGram tokenizer and using the analyze api it creates below token.

{
  "tokens": [
    {
      "token": "f",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 1
    },
    {
      "token": "fo",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "foo",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "b",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 4
    },
    {
      "token": "ba",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 5
    },
    {
      "token": "bar",
      "start_offset": 4,
      "end_offset": 7,
      "type": "word",
      "position": 6
    }
  ]
}

但是当我在我的代码中将文本"foo bar"传递给方法 tokenStream 时,它会在下面为 foo bar 创建令牌.

But when in my code I pass the text "foo bar" to method tokenStream, it create below tokens for foo bar.

f,fo,foo,foo,foo b,foo ba,foo bar.

f, fo, foo, foo , foo b, foo ba, foo bar.

这导致 analyze api返回的令牌不匹配.我想知道如何添加一个char过滤器以删除文本中的空格,并对文本中的各个术语应用Edge NGram标记器.

This is causing the mismatch in the tokens returned by analyze api. I want to know how can I add a char filter which removes the space in the text and apply Edge NGram tokenizer on individual terms in the text.

因此,在 foo bar 示例中,它应在令牌下面创建.当我调用 tokenStream 方法时.

So, In the foo bar example, it should create below token. when I call tokenStream method.

f,fo,foo,b,ba,bar

f, fo, foo, b, ba, bar.

我尝试将char过滤器添加到创建分析器的Java代码中.下面是它的代码.

I tried adding the char filter to my java code of create the analyzer. Below is the code of it.

@Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        NormalizeCharMap normalizeCharMap = new NormalizeCharMap();
        normalizeCharMap.add(" ", "\\u2424");
        Reader replaceDots = new MappingCharFilter(normalizeCharMap, reader);
        TokenStream result = new EdgeNGramTokenizer(replaceDots, EdgeNGramTokenizer.DEFAULT_SIDE, 1, 30);
        return result;
    }

但是它需要 lu2424 .还请让我知道我的分析器代码是否正确?

But it takes lu2424 as it as. Also please let me know if my code of Analyzer is correct or not?

推荐答案

您使用analytics API测试的是

What you have tested using the analyze API is an edge-ngram token filter, which is different from an edge-ngram tokenizer.

在您的代码中,如果您希望代码中的行为与使用分析API测试的行为相同,则需要用 EdgeNGramTokenFilter 替换 EdgeNGramTokenizer .

In your code, you need to replace EdgeNGramTokenizer by EdgeNGramTokenFilter if you want to have the same behavior in your code as you tested with the analyze API.

这篇关于使用Edge N Gram分析器和char过滤器创建分析器,用新行替换空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆