如何使用include和regex正确查询elasticsearch中术语聚合值的内部? [英] How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

查看:45
本文介绍了如何使用include和regex正确查询elasticsearch中术语聚合值的内部?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何有效地过滤/搜索聚合结果?

假设您在弹性搜索中有 100 万个文档.在这些文档中,您有一个 multi_field (keyword, text) tags:

Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:

{
  ...
  tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
  ...
},
{
  ...
  tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
  ...
},
{
  ...
  tags: ['Surfing', 'Race', 'Disgrace'],
  ...
},

您可以将这些值用作过滤器(方面),针对查询仅提取包含此标签的文档:

You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:

...
"filter": [
  {
    "terms": {
      "tags": [
        "Race"
      ]
    }
  },
  ...
]

但您希望用户能够查询可能的标签过滤器.因此,如果用户键入 race,返回应显示(来自前面的示例),['Race', 'Tracey Chapman', 'Disgrace'].这样,用户就可以查询要使用的过滤器.为了实现这一点,我不得不使用聚合:

But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:

{
  "aggs": {
    "topics": {
      "terms": {
        "field": "tags",
        "include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
        "size": 6
      }
    }
  },
  "size": 0
}

这正是我需要的!但它很慢,非常慢.我试过添加 execution_hint,它对我没有帮助.

This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.

您可能会想,只需在聚合之前使用查询即可!";但问题是它会提取该查询中所有文档的所有值.这意味着,您可以显示完全不相关的标签.如果我在聚合之前查询 race,并且没有使用包含正则表达式,我最终会得到所有其他值,例如 'Horizo​​ntal' 等...

You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...

如何重写此聚合以更快地工作?有没有更好的方法来写这个?我真的必须为值创建一个单独的索引吗?(悲伤的脸)这似乎是一个常见问题,但通过文档和谷歌搜索没有找到答案.

How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.

推荐答案

你当然不需要一个单独的索引只为值...

You certainly don't need a separate index just for the values...

这是我的看法:

  1. 您对正则表达式所做的基本上是 tokenizer -- 即构造子串(或 N-grams) 以便以后可以定位它们.
    这意味着关键字 Race 需要被标记为 n-grams [rac", race", ace"].(少于 3 个字符实际上没有意义——大多数自动完成库选择忽略少于 3 个字符,因为可能的匹配膨胀太快.)
  1. What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
    This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)

Elasticsearch 提供 N-gram tokenizer 但我们需要增加名为 max_ngram_diff 从 1 到(任意)10,因为我们想要捕获尽可能多的 ngrams:

Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:

PUT tagindex
{
  "settings": {
    "index": {
      "max_ngram_diff": 10
    },
    "analysis": {
      "analyzer": {
        "my_ngrams_analyzer": {
          "tokenizer": "my_ngrams",
          "filter": [ "lowercase" ]
        }
      },
      "tokenizer": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": [ "letter", "digit" ]
        }
      }
    }
  },
  { "mappings": ... }                                 --> see below
}

  1. 当您的 tags 字段是关键字列表时,如果不使用 include 选项,就不可能在该字段上进行聚合可以是精确匹配或正则表达式(您已经在使用).现在,我们不能保证完全匹配,但我们也不想正则表达式!所以这就是为什么我们需要使用 嵌套列表这将单独处理每个标签.
  1. When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.

现在,嵌套列表应该包含对象

Now, nested lists are expected to contain objects so

{
  "tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}

需要转换为

{
  "tags": [
    { "tag": "Race" },
    { "tag": "Racing" },
    { "tag": "Mountain Bike" },
    { "tag": "Horizontal" }
  ]
}

之后我们将继续使用 多字段 映射,保持原始标签不变,但还添加了一个 .tokenized 字段来搜索和一个 .keyword 字段来聚合:

After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:

  "index": { ... },
  "analysis": { ... },
  "mappings": {
    "properties": {
      "tags": {
        "type": "nested",
        "properties": {
          "tag": {
            "type": "text",
            "fields": {
              "tokenized": {
                "type": "text",
                "analyzer": "my_ngrams_analyzer"
              },
              "keyword": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
  }

然后我们将添加调整后的标签文档:

We'll then add our adjusted tags docs:

POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}

POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}

POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}

并应用 嵌套 过滤器 条款 聚合:

GET tagindex/_search
{
  "aggs": {
    "topics_parent": {
      "nested": {
        "path": "tags"
      },
      "aggs": {
        "topics": {
          "filter": {
            "term": {
              "tags.tag.tokenized": "race"
            }
          },
          "aggs": {
            "topics": {
              "terms": {
                "field": "tags.tag.keyword",
                "size": 100
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

屈服

{
  ...
  "topics_parent" : {
    ...
    "topics" : {
      ...
      "topics" : {
        ...
        "buckets" : [
          {
            "key" : "Race",
            "doc_count" : 2
          },
          {
            "key" : "Disgrace",
            "doc_count" : 1
          },
          {
            "key" : "Tracey Chapman",
            "doc_count" : 1
          }
        ]
      }
    }
  }
}

注意事项

  • 为了使其正常工作,您必须重新编制索引
  • ngrams 会增加存储空间——取决于每个文档有多少标签,这可能会成为一个问题
  • 嵌套字段在内部被视为单独的文档";所以这也会影响磁盘空间

P.S.:这是一个有趣的用例.让我知道实施的进展情况!

P.S.: This is an interesting use case. Let me know how the implementation went!

这篇关于如何使用include和regex正确查询elasticsearch中术语聚合值的内部?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆