弹性搜索 - 字符串长度的统计方面 [英] ElasticSearch - Statistical facet on length of string field

查看:188
本文介绍了弹性搜索 - 字符串长度的统计方面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检索一个字符串字段的数据,如min,max和average length(通过计算字符串中的字符数)。我的问题是聚合只能用于数字字段。此外,我尝试使用一个简单的统计方面,

I would like to retrieve data about a string field like the min, max and average length (by counting the number of characters inside the string). My issue is that aggregations can only be used for numeric fields. Besides, I tried it using a simple statistical facet,

 "query":{
      "match_all": {}
  }, 
 "facets":{
      "stat1":{
           "statistical":{
               "field":"title"}
               }
          } 

但是我收到了分片失败和SearchPhaseExecutionException。当尝试使用脚本字段时,返回的错误是OutOfMemoryError:

but I get shard failures and SearchPhaseExecutionException. When trying with a script field the error returned is an OutOfMemoryError:

  "query":{
       "match_all": {}
   }, 
  "script_fields":{
       "test1":{"script": "doc[\"title\"].value" }
   }

是否可以使用CURL检索一个简单的标题字符串字段的数据?谢谢!

Is it possible to retrive such data about a simple "title" string field using CURL? Thank you!

推荐答案

我没有实际尝试过以下操作,但我相信它应该可以工作。

I haven't actually tried the following, but I believe it should work.

首先一些有用的文献参考:

First some useful doc-references:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-方面,统计的facet.html


为了实现统计方面,相关的字段值
从索引加载到内存中。这意味着每个分片,
应该有足够的内存来容纳它们。由于默认情况下,
动态引入类型为long和double,所以减少
内存占用的一个选项是将相关的
字段的类型显式设置为short,integer或float可能的。

In order to implement the statistical facet, the relevant field values are loaded into memory from the index. This means that per shard, there should be enough memory to contain them. Since by default, dynamic introduced types are long and double, one option to reduce the memory footprint is to explicitly set the types for the relevant fields to either short, integer, or float when possible.

我不知道如何将脚本字段的类型设置为short,这可能是你想。以减少记忆。它应该是可能的。

I'm not sure directly how to set the type of the script-field to 'short' which is probably what you want. to reduce memory. it SHOULD be possible though.

另外: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-script-fields.html


了解
doc ['my_field']。value和_source.my_field之间的区别很重要。首先,使用doc
关键字,将导致该字段的条款加载到内存
(缓存),这将导致更快的执行,但更多的内存
消耗。此外,doc [...]符号只允许简单的有价值的
字段(不能从中返回一个json对象),只对
未分析或基于单个字段的字段有意义。 / p>

It’s important to understand the difference between doc['my_field'].value and _source.my_field. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (can’t return a json object from it) and make sense only on non-analyzed or single term based fields.

所以ALTERNATIVE:将使用 _source 而不是 doc 不会缓存长度。

So ALTERNATIVE: would be to use _source instead of doc which would not cache the lengths.

给予:

    {
        "query" : {
            "match_all" : {}
        },
        "facets" : {
            "stat1" : {
                "statistical" : {
                    "script" : "doc['title'].value.length()
                    //"script" : "_source.title.length() //ALTERNATIVE which isn't cached
                }
            }
        }
    }

这篇关于弹性搜索 - 字符串长度的统计方面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆