not_analyzed字段与doc_values仍然在fielddata缓存 [英] not_analyzed field with doc_values still in fielddata cache

查看:285
本文介绍了not_analyzed字段与doc_values仍然在fielddata缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

fielddata vs doc_values ,我遇到了一个奇怪的情况。在我早期的映射中,根本没有使用doc值。在我的新映射中,除了分析字符串字段和布尔( doc_values:true com / elastic / elasticsearch / issues / 7851rel =nofollow>不支持,直到2.0 )。



所以在细节上,这是我如何进行:在重新索引所有数据之前,我重新启动了我的ES 1.7集群,并运行了一个带有排序,聚合和脚本字段的查询,以加热fielddata缓存。然后我查询了 / fielddata 端点,以了解fielddata缓存使用情况。它看起来像这样:

  curl -XGET'localhost:9200 / _cat / fielddata?v& fields = *'

id主机ip节点总items.desc.raw more_fields ...
rKX7 ... myhost 192.168.1.100 Doom 32.9mb 2.3mb ...

如你所见,字段 items.desc.raw 使用2.3mb的堆空间。 项目的类型为嵌套,并包含一个字符串多字段,一个 not_analyzed 子字段称为 raw 。简而言之,该嵌套字段的映射如下所示:

 items:{
type: nested,
properties:{
desc:{
type:string,
fields:{
raw :{
type:string,
index:not_analyzed
}
}
}
}
}

添加 doc_values:true items.desc.raw ,重新索引整个索引并运行一些聚合,再次排序和编写脚本以加快fieldData缓存,我查询了 / fielddata 端点,结果如下:

  curl -XGET'localhost:9200 / _cat / fielddata?v& ; fields = *'

id主机ip节点总items.desc.raw some_bools ...
tAB5 ... myhost 192.168.1.100 Yack 2.1mb 9.2kb ...

所以现场数据的使用确实已经大大降低(这是很好的),我看到的唯一的领域是布尔字段 some_bools 以上),但令人惊讶的是,我的嵌套 not_analyzed 字符串字段也出现了,但是有一个很多较低的空间使用率。



可能是因为 items.desc.raw 仍然出现在fielddata缓存中的原因?

解决方案

不知何故,我忘记了全局序数。即使在使用 doc_values 之后,我仍然得到fielddata的用法,因为全局序号不能包含在 doc_values



请参阅更多细节here


During some experiment with fielddata vs doc_values, I encountered a weird case. In my earlier mapping, I didn't use doc values at all. In my new mapping, I've added doc_values: true to all fields in my mapping, except analyzed string fields and booleans (not supported until 2.0).

So in details, here is how I proceeded:

Before reindexing all my data, I restarted my ES 1.7 cluster fresh and ran a query with sorting, aggregations and script fields to "warm up" the fielddata cache. Then I queried the /fielddata endpoint to have an idea of the fielddata cache usage. It looked something like this:

curl -XGET 'localhost:9200/_cat/fielddata?v&fields=*'

id      host   ip            node  total  items.desc.raw more_fields...
rKX7... myhost 192.168.1.100 Doom  32.9mb 2.3mb          ...

As you can see, the field items.desc.raw used 2.3mb of heap space. items is of type nested and contains a string multi-field with a not_analyzed sub-field called raw. In short, the mapping of that nested field looks like this:

    "items": {
      "type": "nested",
      "properties": {
        "desc": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }

After adding doc_values: true to items.desc.raw, reindexing the whole index and running some aggregations, sorting and scripting again to warm up the fielddata cache, I queried the /fielddata endpoint again and here was the result:

curl -XGET 'localhost:9200/_cat/fielddata?v&fields=*'

id      host   ip            node  total  items.desc.raw some_bools...
tAB5... myhost 192.168.1.100 Yack  2.1mb  9.2kb          ...

So the fielddata usage has indeed been drastically lowered (which is good), the only fields I see are boolean fields (i.e. some_bools above) which was expected, but to my surprise, my nested not_analyzed string field also appeared, but with a much lower space usage.

What could be the cause of items.desc.raw still appearing in the fielddata cache?

解决方案

Somehow I forgot about global ordinals. They are the reason why I'm still getting fielddata usage even after using doc_values as global ordinals cannot be included in doc_values.

See more details here

这篇关于not_analyzed字段与doc_values仍然在fielddata缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆