获取整个索引中的总词频(Elasticsearch) [英] Getting total term frequency throughout entire index (Elasticsearch)

查看:360
本文介绍了获取整个索引中的总词频(Elasticsearch)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算特定术语在整个索引中出现的总次数(术语收集频率)。我试图通过使用术语向量来做到这一点,但是这仅限于单个文档。即使在指定文档中存在术语的情况下,响应似乎也达到了某个doc_count的最大值(在field_statistics之内),这使我对其准确性表示怀疑。

I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.

请求:

http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true

此处使用的文档ID为 AVmk-ky6XMskTDwIwpih,尽管术语统计信息不应特定于文档。

The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.

响应:

这是我对以下领域之一的癌症一词的理解:

This is what I get for the term "cancer" for one of the fields:

 "cancer" : {
      "doc_freq" : 5297,
      "ttf" : 10587,
      "term_freq" : 1,
      "tokens" : [
        {
          "position" : 15,
          "start_offset" : 115,
          "end_offset" : 121
        }
      ]
    },

如果我总计所有字段的ttf,我得到18915。但是,实际上,癌症的总术语频率实际上是542829。这使我相信,它将term_vector stats限制为索引内文档的子集。

If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.

这里的任何建议将不胜感激。

Any advice here would be greatly appreciated.

推荐答案

原因计数差异的原因是术语向量不准确,除非所讨论的索引具有单个分片。对于具有多个分片的索引,文档分布在所有分片上,因此返回的频率不是总数,而是随机选择的分片的频率。

The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.

因此,返回的频率频率只是一个相对度量,而不是您期望的绝对值。 请参阅行为部分
为了测试这一点,您可以创建一个分片索引并请求频率(它应该给您实际的总数)。

Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).

这篇关于获取整个索引中的总词频(Elasticsearch)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆