“实际点击"计数(不仅匹配文档)在Elasticsearch中进行任意查询 [英] Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

查看:67
本文介绍了“实际点击"计数(不仅匹配文档)在Elasticsearch中进行任意查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这真的让我感到沮丧.我尝试寻找解决方案已经很长时间了,但是无论我在哪里尝试从人们那里寻求相同的问题,他们要么想要一些不同的东西(例如此处),或者没有得到解决问题的答案(例如

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).

我需要的

我想知道我的搜索总共有多少点击,与所使用的查询类型无关.我不是在谈论您总是从ES获得的点击数,这是为该查询找到的文档数,而是在与我的查询匹配的文档特征的出现次数em>.
例如,我可以有两个文档,其文本字段为描述",都包含单词 hero ,但是其中一个文档包含两次.
就像这里的最小示例一样:

I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:

索引映射:

PUT /sample
{
    "settings": {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 0
        }
    },
    "mappings": {
        "doc": {
            "properties": {
                "name": { "type": "keyword" },
                "description": { "type": "text" }
            }
        }
    }
}

两个示例文档:

POST /sample/doc
{
    "name": "Jack Beauregard",
    "description": "An aging hero"
}


POST /sample/doc
{
    "name": "Master Splinter",
    "description": "This rat is a hero, a real hero!"
}

...以及查询:

POST /sample/_search
{
    "query": {
        "match": { "description": "hero" }
    },
    "_source": false
}

...这给了我

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.22396864,
        "hits": [
            {
                "_index": "sample",
                "_type": "doc",
                "_id": "hoDsm2oB22SyyA49oDe_",
                "_score": 0.22396864
            },
            {
                "_index": "sample",
                "_type": "doc",
                "_id": "h4Dsm2oB22SyyA49xDf8",
                "_score": 0.22227617
            }
        ]
    }
}

因此有两个匹配(总计":2 ),这是正确的,因为查询匹配两个文档.但是,我想知道我的查询多次匹配每个文档中的 (或总和),在此示例中为 3 ,因为第二个文档包含搜索词两次.
重要:这只是一个简单的示例.但是我希望它适用于任何类型的查询和任何映射,还可以嵌套 inner_hits 等所有文档.
我没想到这会这么困难,因为无论如何它一定是ES在搜索过程中遇到的信息,对吗?我的意思是,它会将文档中 more 个命中的文档排在 higher 之中,那么为什么我无法获得这些命中的计数呢?
我很想称它们为内部命中",但这是另一种ES功能的名称(请参见下文).

So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).

我尝试过的方法/可以尝试的方法(但是很丑)

  • 我可以使用突出显示(无论如何我都会这样做),并尝试使突出显示器为每个内部匹配"生成一个突出显示(并且不要将它们组合在一起),然后对搜索结果的 complete 集进行后处理并计算所有亮点->当然,这是非常难看的,因为(1)我真的不想对我的结果进行后处理,并且(2)我必须通过将 size 设置得足够高才能获得所有结果值,但实际上我只想获取客户端请求的结果数.这会带来很多开销!
  • 功能 inner_hits 听起来很有前途,但这仅意味着您可以独立处理嵌套文档中的匹配,以突出显示每个匹配.我已经将它用于我的嵌套文档,但是它不能解决这个问题,因为(1)它在内部命中级别上仍然存在,并且(2)我也希望它也可用于非嵌套查询.
  • I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
  • The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.

是否有一种通用方法可以对任意查询实现此目的?如果有任何建议,我将不胜感激.我什至不愿意通过修改排名或使用脚本字段来解决问题.

Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.

非常感谢!

推荐答案

由于性能不佳,我绝对不建议将其用于任何实际用途,但是从频率计算结果来看,该数据在技术上可用于术语频率计算"中.说明API.有关概念性的信息,请参见什么是相关性?说明和解释API 以供使用.

I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

这篇关于“实际点击"计数(不仅匹配文档)在Elasticsearch中进行任意查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆