获取整个索引中的总词频(Elasticsearch) [英] Getting total term frequency throughout entire index (Elasticsearch)
问题描述
我正在尝试计算特定术语在整个索引中出现的总次数(术语收集频率)。我试图通过使用术语向量来做到这一点,但是这仅限于单个文档。即使在指定文档中存在术语的情况下,响应似乎也达到了某个doc_count的最大值(在field_statistics之内),这使我对其准确性表示怀疑。
I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.
请求:
http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true
此处使用的文档ID为 AVmk-ky6XMskTDwIwpih,尽管术语统计信息不应特定于文档。
The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.
响应:
这是我对以下领域之一的癌症一词的理解:
This is what I get for the term "cancer" for one of the fields:
"cancer" : {
"doc_freq" : 5297,
"ttf" : 10587,
"term_freq" : 1,
"tokens" : [
{
"position" : 15,
"start_offset" : 115,
"end_offset" : 121
}
]
},
如果我总计所有字段的ttf,我得到18915。但是,实际上,癌症的总术语频率实际上是542829。这使我相信,它将term_vector stats限制为索引内文档的子集。
If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.
这里的任何建议将不胜感激。
Any advice here would be greatly appreciated.
推荐答案
原因计数差异的原因是术语向量不准确,除非所讨论的索引具有单个分片。对于具有多个分片的索引,文档分布在所有分片上,因此返回的频率不是总数,而是随机选择的分片的频率。
The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.
因此,返回的频率频率只是一个相对度量,而不是您期望的绝对值。 请参阅行为部分。
为了测试这一点,您可以创建一个分片索引并请求频率(它应该给您实际的总数)。
Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).
这篇关于获取整个索引中的总词频(Elasticsearch)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!