轻松遍历ElasticSearch文档源数组 [英] Loop though ElasticSearch documents source array in painless

查看:96
本文介绍了轻松遍历ElasticSearch文档源数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于网上商店中的产品,我具有以下ElasticSearch数据结构:

I have the following ElasticSearch data structure for products in a webshop:

{
  "_index": "vue_storefront_catalog_1_product_1617378559",
  "_type": "_doc",
  "_source": {
    "configurable_children": [
      {
        "price": 49.99,
        "special_price": 34.99,
        "special_from_date": "2020-11-27 00:00:00",
        "special_to_date": "2020-11-30 23:59:59",
        "stock": {
          "qty": 0,
          "is_in_stock": false,
          "stock_status": 0
        }
      }
      {
        "price": 49.99,
        "special_price": null,
        "special_from_date": null,
        "special_to_date": null,
        "stock": {
          "qty": 0,
          "is_in_stock": false,
          "stock_status": 0
        }
      }
    ]
}

使用以下映射:

{
  "vue_storefront_catalog_1_product_1614928276" : {
    "mappings" : {
      "properties" : {
        "configurable_children" : {
          "properties" : {
            "price" : {
              "type" : "double"
            },
            "special_from_date" : {
              "type" : "date",
              "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            },
            "special_price" : {
              "type" : "double"
            },
            "special_to_date" : {
              "type" : "date",
              "format" : "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            },
          }
        }
      }
    }
  }
}

我创建了一个Elasticsearch查询,以仅过滤出正在销售的产品,这意味着:special_price必须低于价格,并且当前日期必须在special_from_date和special_to_date之间.

I have created a Elasticsearch query to filter out only products that are in sale, that means: the special_price must be lower than the price and the current date must be between the special_from_date and special_to_date.

这是我创建的无痛脚本:

This is the Painless script I have created:

  boolean hasSale = false;

  long timestampNow = new Date().getTime();
  if (doc.containsKey('configurable_children.special_from_date') && !doc['configurable_children.special_from_date'].empty) {
    long timestampSpecialFromDate = doc['configurable_children.special_from_date'].value.toInstant().toEpochMilli();
    if (timestampSpecialFromDate > timestampNow) {
      hasSale = false;
    }
  } else if (doc.containsKey('configurable_children.special_to_date') && !doc['configurable_children.special_to_date'].empty) {
    long timestampSpecialToDate = doc['configurable_children.special_to_date'].value.toInstant().toEpochMilli();
    if (timestampSpecialToDate < timestampNow) {
      hasSale = false;
    }
  } else if (doc.containsKey('configurable_children.stock.is_in_stock') && doc['configurable_children.stock.is_in_stock'].value == false) {
      hasSale = false;
  } else if (1 - (doc['configurable_children.special_price'].value / doc['configurable_children.price'].value) > params.fraction) {
    hasSale = true;
  }

  return hasSale

只要其中一个configurable_children符合成为销售产品的条件,这将返回产品.这是不正确的,因为我需要遍历整个操作op configurable_children以确定它是否是一种销售产品.如何确定所有孩子都被纳入计算?有循环吗?

This returns the product as soon as one of the configurable_children has met the criteria to be a sale product. This is incorrect, because I need to loop through the whole set op configurable_children to determine if it's a sale product. How can I make sure all children are taken into the calculation? With a loop?

这是乔在答案中建议的新查询:

Here is the new query as suggested by Joe in the answers:

GET vue_storefront_catalog_1_product/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "source": """
                int allEntriesAreTrue(def arrayList) {
                  return arrayList.stream().allMatch(Boolean::valueOf) == true ? 1 : 0
                } 
                
                ArrayList childrenAreMatching = [];
                
                long timestampNow = params.timestampNow;
                
                ArrayList children = params._source['configurable_children'];
                
                if (children == null || children.size() == 0) {
                  return allEntriesAreTrue(childrenAreMatching);
                }
                
                for (config in children) {
                  if (!config.containsKey('stock')) {
                    childrenAreMatching.add(false);
                    continue;
                  } else if (!config['stock']['is_in_stock']
                      || config['special_price'] == null
                      || config['special_from_date'] == null 
                      || config['special_to_date'] == null) {
                    childrenAreMatching.add(false);
                    continue;
                  } 
                  
                  if (config['special_from_date'] != null && config['special_to_date'] != null) {
                    SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                    def from_millis = sf.parse(config['special_from_date']).getTime();
                    def to_millis = sf.parse(config['special_to_date']).getTime();
                    
                    if (!(timestampNow >= from_millis && timestampNow <= to_millis)) {
                      childrenAreMatching.add(false);
                      continue;
                    }
                  }
                  
                  def sale_fraction = 1 - (config['special_price'] / config['price']);
                  if (sale_fraction <= params.fraction) {
                    childrenAreMatching.add(false);
                    continue;
                  }
                  
                  childrenAreMatching.add(true);
                }
                return allEntriesAreTrue(childrenAreMatching);
              """,
              "params": {
                "timestampNow": 1617393889567,
                "fraction": 0.1
              }
            }
          }
        }
      ],
      "min_score": 1
    }
  }
}

响应如下:

{
  "took" : 15155,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2936,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [... hits here ...]
  }
}

知道为什么查询需要15秒钟左右吗?

Any idea why the query takes around 15 seconds?

推荐答案

您的直觉是正确的-如果要检查全部 for 循环>数组列表对象.

现在,在我进入迭代方面之前,有一件重要的事情要了解Elasticsearch中的数组.如果未将其定义为嵌套" ,则其内容为

Your intuition is right — you'll need to use a for loop if you want to check all of the array list objects.

Now, before I jump onto the iteration aspect, there's one important thing to know about arrays in Elasticsearch. When they're not defined as nested, their content will be flattened and the relationships between the individual key/value pairs will be lost. As such, you should definitely adjust your mapping like so:

{
  "vue_storefront_catalog_1_product_1614928276" : {
    "mappings" : {
      "properties" : {
        "configurable_children" : {
          "type": "nested",        <---
          "properties" : {
            "price" : {
              "type" : "double"
            },
            ...
          }
        }
      }
    }
  }
}

并对数据重新编制索引,以确保将 configurable_children 视为独立的独立实体.

and reindex your data to ensure that the configurable_children are treated as separate, standalone entities.

将它们映射为嵌套后,您将能够仅检索与您的脚本条件匹配的那些子代:

As soon as they're mapped as nested, you'll be able to retrieve just those children that do match your scripted condition:

POST vue_storefront_catalog_1_product_1614928276/_search
{
  "_source": "configurable_children_that_match", 
  "query": {
    "nested": {
      "path": "configurable_children",
      "inner_hits": {
        "name": "configurable_children_that_match"
      }, 
      "query": {
        "bool": {
          "must": [
            {
              "script": {
                "script": {
                  "source": """
                    boolean hasSale = false;
                    
                    long timestampNow = new Date().getTime();
                    
                    if (doc.containsKey('configurable_children.special_from_date') && !doc['configurable_children.special_from_date'].empty) {
                      long timestampSpecialFromDate = doc['configurable_children.special_from_date'].value.toInstant().toEpochMilli();
                      if (timestampSpecialFromDate > timestampNow) {
                       return false
                      }
                    } 
                    
                    if (doc.containsKey('configurable_children.special_to_date') && !doc['configurable_children.special_to_date'].empty) {
                      long timestampSpecialToDate = doc['configurable_children.special_to_date'].value.toInstant().toEpochMilli();
                      if (timestampSpecialToDate < timestampNow) {
                        return false
                      }
                    }
                    
                    if (doc.containsKey('configurable_children.stock.is_in_stock') && doc['configurable_children.stock.is_in_stock'].value == false) {
                        return false
                    }
                    
                    if (1 - (doc['configurable_children.special_price'].value / doc['configurable_children.price'].value) > params.fraction) {
                      hasSale = true;
                    }
                    
                    return hasSale
                  """,
                  "params": {
                    "fraction": 0.1
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
}

这里要注意两件事:

  1. <代码的> inner_hits 属性"nofollow noreferrer> 嵌套查询可让您让Elasticsearch知道您只对真正匹配的子代感兴趣.否则,将返回所有 configurable_children .在 _source中指定时参数,将跳过原始的完整JSON文档源,仅返回命名为 inner_hits
  2. .
  3. 由于ES的分布式特性,不建议使用Java的 new Date().我已经对
  1. The inner_hits attribute of a nested query allows you to let Elasticsearch know that you're only interested in those children that truly matched. Otherwise, all configurable_children would be returned. When specified in the _source parameter, the original, full JSON document source would be skipped and only the named inner_hits would be returned.
  2. Due to the distributed nature of ES, it's not recommended to use java's new Date(). I've explained the reasoning behind it my answer to How to get current time as unix timestamp for script use. You'll see me use a parametrized now in the query at the bottom of this answer.

继续,重要的是要提到嵌套对象在内部被内部表示为单独的子文档.

此事实的副作用是,一旦您进入嵌套查询的上下文,就无法访​​问同一文档的其他嵌套子级.

Moving on, it's important to mention that nested objects are internally represented internally as separate subdocuments.

A side effect of this fact is that once you're inside a nested query's context, you don't have access to other nested children of the very same document.

为了减轻这种情况,习惯上要定期使嵌套的子代保持同步,这样,当您弄平对象的一个​​属性以供顶层使用时,可以简单地迭代各个doc值.平整通常通过 copy_to 功能完成,我在如何使用过滤器脚本迭代Elasticsearch中的嵌套数组?

In order to mitigate this, it's customary to regularly keep the nested children in sync such that when you do flatten one of the objects' attributes for use on the top-level, you can use a simply iterate the respective doc values. This flattening is usually done through the copy_to feature which I illustrated in my answer to How to iterate through a nested array in elasticsearch with filter script?

在您的特定用例中,这意味着您例如在 stock.is_in_stock 字段上使用 copy_to 级别的布尔数组列表,比对象数组列表更易于使用.

In your particular use case, this'd mean that you'd, for instance, use copy_to on the field stock.is_in_stock which'd result in a top-level boolean array list which is easier to work with than an array list of objects.

到目前为止,还不错,但是您仍然缺少基于 special_dates 进行过滤的方法.

So far so good but you'd still be missing a way to filter based on the special_dates.

现在,无论您要处理的是嵌套还是常规的 object 字段类型,在常规脚本查询中访问 params._source 都不会从 v6.4 开始在ES中不起作用.

Now, regardless of whether you're dealing with nested or regular object field types, accessing params._source in regular script queries doesn't work in ES since v6.4.

如您的问题所述,您

..需要遍历整个 configurable_children 的整个集合,以确定它是否是一种销售产品.

..need to loop through the whole set of configurable_children to determine if it's a sale product..

话虽如此,以下是我的查询的工作方式:

With that being said, here's how my query below works:

  1. <代码> function_score 查询通常会生成一个自定义的计算分数,但是可以在 min_score 的帮助下用作布尔型是/否过滤器,以排除其 configurable_children 不满足特定条件.
  2. 在迭代 configurable_children 时,每个循环将一个布尔值附加到 childrenAreMatching 上,然后将其传递给 allEntriesAreTrue 帮助器,该帮助器返回1如果是,则为0.
  3. 解析日期并将其与参数化的 now 进行比较;还比较了 fraction .如果在任何时候 some 条件失败,则循环跳至下一个迭代.
  1. The function_score query typically generates a custom calculated score, but it can, with the help of min_score, be used as a boolean yes/no filter to exclude docs whose configurable_children do not fulfil a certain condition.
  2. As the configurable_children are being iterated, each loop appends a boolean to childrenAreMatching which is then passed onto the allEntriesAreTrue helper which returns a 1 if they are, and a 0 if not.
  3. The dates are parsed and compared with the parametrized now; the fraction is compared too. If, at any point, some condition fails, the loop jumps to the next iteration.

POST vue_storefront_catalog_1_product_1614928276/_search
{
  "query": {
    "function_score": {
      "query": {
        "match_all": {}
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "source": """
                // casting helper
                int allEntriesAreTrue(def arrayList) {
                  return arrayList.stream().allMatch(Boolean::valueOf) == true ? 1 : 0
                } 
                
                ArrayList childrenAreMatching = [];
                
                long timestampNow = params.timestampNow;
                
                ArrayList children = params._source['configurable_children'];
                
                if (children == null || children.size() == 0) {
                  return allEntriesAreTrue(childrenAreMatching);
                }
                
                for (config in children) {
                  if (!config['stock']['is_in_stock']
                      || config['special_price'] == null
                      || config['special_from_date'] == null 
                      || config['special_to_date'] == null) {
                    // nothing to do here...
                    childrenAreMatching.add(false);
                    continue;
                  } 
                  
                  SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                  def from_millis = sf.parse(config['special_from_date']).getTime();
                  def to_millis = sf.parse(config['special_to_date']).getTime();
                  
                  if (!(timestampNow >= from_millis && timestampNow <= to_millis)) {
                    // not in date range
                    childrenAreMatching.add(false);
                    continue;
                  }
                  
                  def sale_fraction = 1 - (config['special_price'] / config['price']);
                  if (sale_fraction <= params.fraction) {
                    // fraction condition not met
                    childrenAreMatching.add(false);
                    continue;
                  }
                  
                  childrenAreMatching.add(true);
                }
                
                // need to return a number because it's a script score query
                return allEntriesAreTrue(childrenAreMatching);
              """,
              "params": {
                "timestampNow": 1617393889567,
                "fraction": 0.1
              }
            }
          }
        }
      ],
      "min_score": 1
    }
  }
}

总而言之,仅返回其全部 configurable_children 满足指定条件的那些文档.

All in all, only those documents, whose all configurable_children fulfil the specified conditions, would be returned.

P.S.如果您从此答案中学到了什么,并且想了解更多,我会在我的Elasticsearch手册中整整一章介绍ES脚本.

P.S. If you learned something from this answer and want to learn more, I dedicated a whole chapter to ES scripts in my Elasticsearch Handbook.

这篇关于轻松遍历ElasticSearch文档源数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆