Elasticsearch:计算文档中的术语 [英] Elasticsearch: Count terms in document

查看:87
本文介绍了Elasticsearch:计算文档中的术语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚接触 elasticsearch ,使用6.5版。我的数据库包含网站页面及其内容,例如:

I'm fairly new to elasticsearch, use version 6.5. My database contains website pages and their content, like this:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.

我已经能够执行一个简单的查询,该查询返回所有包含 cars一词的文档它们的内容(使用Python):

I have been able to perform a simple query that returns all documents that contain the word "cars" in their content (using Python):

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}}, 
    "from": 0, "size": 100})

结果看起来像这样:

{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}

_ id是指域,所以我基本上会回来:

The "_id"s are referring to a domain, so I basically get back:


  • abc.com

  • def.com

  • jkl.com

但是我现在想知道在每个文档中中存在搜索词(汽车),例如:

But I now want to know how often the searchterm ("cars") is present in each document, like:


  • abc.com:2

  • def.com:1

  • jkl.com:2

我找到了几种解决方案,该解决方案如何获取包含搜索词的文档数量,但是没有一种解决方案可以告诉您如何获取文档中的 数量。在官方文档,尽管我很确定在某个地方,而且我可能只是没有意识到这是解决我的问题的方法。

I found several solutions how to obtain the number of documents that contain the searchterm, but none that would tell how to get the number of terms in a document. I also couldn't find anything in the official documentation, although I'm pretty sure is in there somewhere and I'm maybe just not realising that it is the solution for my problem.

更新:

按照@Curious_MInd的建议,我尝试了术语汇总:

As suggested by @Curious_MInd I tried term aggregation:

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content" 
}}}})

结果:

{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful': 
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0, 
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252', 
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations': 
{'skala_count': {'doc_count_error_upper_bound': 0, 
'sum_other_doc_count': 0, 'buckets': []}}}

我在这里看不到它将显示每个文档的计数,但是我假设这是因为存储桶为空?另一个要注意的是:术语聚合发现的结果明显比multi_match查询的结果差。有什么办法可以将它们组合在一起?

I don't see where it would display the counts per document here, but I'm assuming that's because "buckets" is empty? On another note: The results found by term aggregation are significantly worse than those with multi_match query. Is there any way to combine those?

推荐答案

您要实现的目标无法在单个查询中完成。第一个查询将是过滤并获取需要对术语进行计数的文档ID。
假设您具有以下映射:

What you are trying to achieve can't be done in a single query. The first query will be to filter and get the doc Ids for which the terms counts is required. Lets assume you have the following mapping:

{
  "test": {
    "mappings": {
      "_doc": {
        "properties": {
          "details": {
            "type": "text",
            "store": true,
            "term_vector": "with_positions_offsets_payloads"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

假设您的查询返回以下两个文档:

Assuming you query returns the following two docs:

{
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "details": "There is some content about cars here. Lots of cars!",
          "name": "n1"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "details": "This page is all about cars",
          "name": "n2"
        }
      }
    ]
  }
}

从上面的响应中,您可以获得与查询匹配的所有文档ID。对于上述内容,我们有: _ id: 1 _ id: 2

From the above response you can get all the document ids that matched your query. For above we have : "_id": "1" and "_id": "2"

现在我们使用 _mtermvectors api来获取给定字段中每个术语的频率(计数):

Now we use _mtermvectors api to get the frequency(count) of each term in a given field:

test/_doc/_mtermvectors
{
  "docs": [
    {
      "_id": "1",
      "fields": [
        "details"
      ]
    },
    {
      "_id": "2",
      "fields": [
        "details"
      ]
    }
  ]
}

以上返回以下结果:

{
  "docs": [
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 8,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 2,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 28,
                  "end_offset": 32
                },
                {
                  "position": 9,
                  "start_offset": 47,
                  "end_offset": 51
                }
              ]
            },
            ....
          }
        }
      }
    },
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 2,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 23,
                  "end_offset": 27
                }
              ]
            },
            ....
        }
      }
    }
  ]
}

请注意,由于术语向量api返回了所有术语的术语相关详细信息,因此我已经使用 .... 来表示字段中的其他术语数据。
您绝对可以从上面的响应中提取有关所需术语的信息,此处显示的是 cars ,而您感兴趣的字段是 term_freq

Note that I have used .... to denote other terms data in the field since the term vector api return the term related details for all the terms. You can definitely extract the info about the required term from the above response, here I have shown for cars and the field you are interested in is term_freq

这篇关于Elasticsearch:计算文档中的术语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆