获取弹性搜索字段中特定术语的出现次数 [英] Get the number of appearances of a particular term in an elasticsearch field

查看:66
本文介绍了获取弹性搜索字段中特定术语的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有以下映射的elasticsearch索引(帖子):

I have an elasticsearch index (posts) with following mappings:

{
    "id": "integer",
    "title": "text",
    "description": "text"
}

我想简单地找到单个特定文档的描述字段中特定术语的出现次数(我有要查找的文档ID和术语)

I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).

例如,我有一个类似{id:123,title:"some title"的帖子,描述:我的城市是洛杉矶,这个帖子描述中有两次出现单词city}.

e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.

我具有此帖子的文档ID/帖子ID,仅想查找"city"一词在此特定帖子的说明中出现了多少次.(在这种情况下,结果应为2)

I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)

似乎无法找到这种搜索的方式,我不希望所有文档中都出现这种情况,而只是针对单个文档及其内部的一个字段.请对此提出建议.谢谢

Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks

Elasticsearch版本:7.5

Elasticsearch Version: 7.5

推荐答案

您可以在 description 上使用 terms 聚合,但需要确保其 fielddata设置为 true .

You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.

PUT kamboh/
{
  "mappings": {
    "properties": {
      "id": {
        "type": "integer"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text",
        "fields": {
          "simple_analyzer": {
            "type": "text",
            "fielddata": true,
            "analyzer": "simple"
          },
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

提取示例文档:

PUT kamboh/_doc/1
{
  "id": 123,
  "title": "some title",
  "description": "my city is LA, this post description has two occurrences of word city "
}

汇总:

GET kamboh/_search
{
  "size": 0,
  "aggregations": {
    "terms_agg": {
      "terms": {
        "field": "description.simple_analyzer",
        "size": 20
      }
    }
  }
}

屈服:

"aggregations" : {
    "terms_agg" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "city",
          "doc_count" : 1
        },
        {
          "key" : "description",
          "doc_count" : 1
        },
        ...
      ]
    }
  }

现在,如您所见, 简单分析器将字符串拆分为单词并使它们变为小写字母,但它也消除了字符串中的重复城市!我想不出可以保留重复项的分析仪...话虽如此,

Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,

您将用空格分隔字符串,并将它们索引为单词数组而不是长字符串.

You would split your string by whitespace and index them as an array of words instead of a long string.

这在搜索时也是可能的,尽管它非常昂贵,无法很好地扩展,并且您需要在es.yaml中具有 script.painless.regex.enabled:true :

This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:

GET kamboh/_search
{
  "size": 0,
  "aggregations": {
    "terms_script": {
      "scripted_metric": {
        "params": {
          "word_of_interest": ""
        },
        "init_script": "state.map = [:];",
        "map_script": """
              if (!doc.containsKey('description')) return;

              def split_by_whitespace = / /.split(doc['description.keyword'].value);

              for (def word : split_by_whitespace) {  
                 if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
                   return;
                 } 

                 if (state.map.containsKey(word)) {
                   state.map[word] += 1;
                   return;
                 }

                 state.map[word] = 1;
              }
""",
        "combine_script": "return state.map;",
        "reduce_script": "return states;"
      }
    }
  }
}

屈服

...
"aggregations" : {
    "terms_script" : {
      "value" : [
        {
          "occurrences" : 1,
          "post" : 1,
          "city" : 2,  <------
          "LA," : 1,
          "of" : 1,
          "this" : 1,
          "description" : 1,
          "is" : 1,
          "has" : 1,
          "my" : 1,
          "two" : 1,
          "word" : 1
        }
      ]
    }
  }
...

这篇关于获取弹性搜索字段中特定术语的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆