如何修改标准分析仪以包含#? [英] How to modify standard analyzer to include #?

查看:76
本文介绍了如何修改标准分析仪以包含#?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

某些字符像#一样被视为分隔符,因此它们在查询中将永远不匹配。允许这些字符匹配的最接近标准的自定义分析器配置应该是什么?

Some characters are treated as delimiters like #, so they would never match in the query. What should be the custom analyzer configuration closest to standard to allow these characters to be matched ?

推荐答案

1)最简单的方法是将空白令牌生成器小写过滤器

1) Simplest way would be to use whitespace tokenizer with lowercase filter.

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'

这会给你

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "#celebration",
    "start_offset" : 9,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}

2)如果只想保留一些特殊字符,则可以使用字符过滤器,这样您的文本将在令牌化之前被转换成其他内容发生。这更接近标准分析器。例如,您可以这样创建索引

2) If you only want to preserve some special characters then, you could map them with char filter, so that your text would be transformed into something else before tokenization takes place. This is more closer to standard analyzer. For e.g you can create your index like this

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_analyzer": {
          "char_filter": [
            "special_mapping"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "special_mapping": {
          "type": "mapping",
          "mappings": [
            "#=>hashtag\\u0020"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "tweet": {
          "type": "string",
          "analyzer": "special_analyzer"
        }
      }
    }
  }
}

现在 curl -XPOST'localhost:9200 / my_index / _analyze?analyzer = special_analyzer& pretty'-d'new year #celebration vegas'
自定义分析器将生成以下令牌

Now for curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas' custom analyzer will generate following tokens

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}

所以您可以像这样搜索

GET my_index/_search
{
  "query": {
    "match": {
      "tweet": "#celebration"
    }
  }
}

您也将只能搜索庆祝活动,因为我在空间 \\u0020 中使用了Unicode,否则我们总是必须使用

you will also be able to search for only celebration because I have used unicode for space \\u0020 otherwise we would always have to search with #

希望这会有所帮助!

这篇关于如何修改标准分析仪以包含#?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆