在elasticsearch中未按字母顺序排序的字段 [英] Fields not getting sorted in alphabetical order in elasticsearch

查看:34
本文介绍了在elasticsearch中未按字母顺序排序的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些带有名称字段的文档.我使用名称字段的分析版本进行搜索,使用 not_analyzed 进行排序.排序发生在一个级别,即名称首先按字母顺序排序.但是在字母表中,名称是按字典顺序而不是按字母顺序排序的.这是我使用的映射:

<代码>{映射":{看到":{特性": {名称": {类型":字符串",领域":{生的": {类型":字符串","index": "not_analyzed"}}}}}}

谁能提供相同的解决方案?

解决方案

深入 Elasticsearch 文档,我偶然发现了这一点:

不区分大小写的排序

<块引用>

假设我们有三个用户文档,其名称字段包含分别是博菲、布朗和贝利.首先我们将应用使用字符串排序和多字段中描述的技术用于排序的 not_analyzed 字段:

PUT/my_index{映射":{用户":{特性": {名称":{//1类型":字符串",领域":{原始":{//2类型":字符串","index": "not_analyzed"}}}}}}}

  1. analyzed name 字段用于搜索.
  2. not_analyzed name.raw 字段用于排序.

<块引用>

前面的搜索请求将按以下顺序返回文档:布朗,博菲,贝利.这被称为字典顺序反对按字母顺序排列.本质上,字节用于表示大写字母的值低于用于的字节数代表小写字母,因此名称按最低字节在前.

这对计算机来说可能有意义,但对计算机来说意义不大合理地期望对这些名称进行排序的人类按字母顺序排列,不分大小写.为了实现这一点,我们需要索引每个名称的字节顺序对应于排序订购我们想要的.

换句话说,我们需要一个分析器,它会发出一个小写字母令牌:

按照此逻辑,您需要使用自定义关键字分析器将其小写,而不是存储原始文档:

PUT/my_index{设置":{分析" : {分析器":{case_insensitive_sort":{"tokenizer" : "关键字",过滤器":[小写"]}}}},映射":{看到":{特性" : {名称" : {类型":字符串",领域":{生的" : {类型":字符串",分析器":case_insensitive_sort"}}}}}}}

现在按 name.raw 排序应该按 字母 顺序排序,而不是 字典序.

使用 Marvel 在我的本地机器上完成的快速测试:

索引结构:

PUT/my_index{设置":{分析": {分析器":{case_insensitive_sort":{"tokenizer": "关键字",筛选": [小写"]}}}},映射":{用户":{特性": {名称": {类型":字符串",领域":{生的": {类型":字符串","index": "not_analyzed"},关键字":{类型":字符串",分析器":case_insensitive_sort"}}}}}}}

测试数据:

PUT/my_index/user/1{姓名":蒂姆"}PUT/my_index/user/2{姓名":汤姆"}

使用原始字段查询:

POST/my_index/user/_search{排序":name.raw"}

结果:

<代码>{"_index": "my_index","_type": "用户","_id": "2",_score":空,_来源" : {姓名":汤姆"},种类" : [汤姆"]},{"_index": "my_index","_type": "用户","_id": "1",_score":空,_来源" : {姓名":蒂姆"},种类" : [蒂姆"]}

使用小写字符串查询:

POST/my_index/user/_search{排序":名称.关键字"}

结果:

<代码>{"_index": "my_index","_type": "用户","_id": "1",_score":空,_来源" : {姓名":蒂姆"},种类" : [蒂姆"]},{"_index": "my_index","_type": "用户","_id": "2",_score":空,_来源" : {姓名":汤姆"},种类" : [汤姆"]}

我怀疑第二个结果在你的情况下是正确的.

I have a few documents with the a name field in it. I am using analyzed version of the name field for search and not_analyzed for sorting purposes. The sorting happens in one level, that is the names are sorted alphabetically at first. But within the list of an alphabet, the names are getting sorted lexicographically rather than alphabetically. Here is the mapping I have used:

{
  "mappings": {
    "seing": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }

Can anyone provide a solution for the same?

解决方案

Digging down into Elasticsearch documents, I stumbled upon this:

Case-Insensitive Sorting

Imagine that we have three user documents whose name fields contain Boffey, BROWN, and bailey, respectively. First we will apply the technique described in String Sorting and Multifields of using a not_analyzed field for sorting:

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": {                    //1
          "type": "string",
          "fields": {
            "raw": {                 //2
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

  1. The analyzed name field is used for search.
  2. The not_analyzed name.raw field is used for sorting.

The preceding search request would return the documents in this order: BROWN, Boffey, bailey. This is known as lexicographical order as opposed to alphabetical order. Essentially, the bytes used to represent capital letters have a lower value than the bytes used to represent lowercase letters, and so the names are sorted with the lowest bytes first.

That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.

In other words, we need an analyzer that will emit a single lowercase token:

Following this logic, instead of storing raw document, you need to lowercase it using custom keyword analyzer:

PUT /my_index
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "case_insensitive_sort" : {
          "tokenizer" : "keyword",
          "filter" : ["lowercase"]
        }
      }
    }
  },
  "mappings" : {
    "seing" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "fields" : {
            "raw" : {
              "type" : "string",
              "analyzer" : "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

Now ordering by name.raw should sort in alphabetical order, rather than lexicographical.

Quick test done on my local machine using Marvel:

Index structure:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive_sort": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            },
            "keyword": {
              "type": "string",
              "analyzer": "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

Test data:

PUT /my_index/user/1
{
  "name": "Tim"
}

PUT /my_index/user/2
{
  "name": "TOM"
}

Query using raw field:

POST /my_index/user/_search
{
  "sort": "name.raw"
}

Result:

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "TOM"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "Tim"
  ]
}

Query using lowercased string:

POST /my_index/user/_search
{
  "sort": "name.keyword"
}

Result:

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "tim"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "tom"
  ]
}

I'm suspecting that second result is correct in your case.

这篇关于在elasticsearch中未按字母顺序排序的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆