获取最新的文档版本并汇总结果 [英] Get the latest document version and aggregate the results

查看:175
本文介绍了获取最新的文档版本并汇总结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的索引包含大量文档,每个文件都有几个版本,例如:

My index contains a lot of documents, each of them has several versions, for example:

{"doc_id": 13,
"version": 1,
"text": "bar"}

{"doc_id": 13,
"version": 2,
"text": "bar"}

{"doc_id": 13,
"version": 3,
"text": "bar"}

{"doc_id": 14,
"version": 1,
"text": "foo"}

{"doc_id": 14,
"version": 2,
"text": "bar"}

我想获取每个文档的最后一个版本,并使用术语聚合来聚合它们(最后版本)。

我试图使用 top hits 以检索最新版本:

I want to get the last version for each document, and aggregate them (last versions) using terms aggregation.
I've tried to use top hits to retrieve last versions:

{"size" :0,
"aggs" : {
    "doc_id_groups" : {
        "terms" : {
            "field" : "doc_id",
            "size" : "0"
        },
        "aggs" : {
            "docs" : {
                "top_hits" : {
                    "size" : 1,
                    "sort" : {
                        "version" : {
                            "order" : "desc"
                        }
                    }
                }
            }
        }
    }
}
}

但我不能聚合,因为顶部命中不支持子聚合。

我猜,检索ids然后聚合它们将是非常繁重的操作的客户端。 br>
也许脚本可以帮助?

But I can't do aggregation, because top hits doesn't support sub aggregations.
I guess retrieving ids and then aggregating them would be very heavy operation for the client.
Maybe scripting could help?

更新:有一件事我忘了提到:在汇总之前,按照时间范围过滤文档,所以我们不知道哪个版本是最新的索引时间,仅在搜索时间

Update: one thing I forgot to mention: before aggregating the documents are filtered by time range, so we don't know which version is the latest at index time, only at search time

推荐答案

从提供的示例和 chat 我不认为您可以使用聚合实现所需的结果。但是我可以提出一个替代方案:

From the provided samples and additional details in chat I do not think you could achieve the required results using the aggregation. But I can propose an alternative solution instead:


  1. 添加属性 对于所有最新版本的文档,
    将被设置为true,类型为布尔。如果插入
    a新版本 - 当前 将被设置为 false
    在旧版本中,并设置为

  2. 添加属性 timepoints ,其中包含多个值。在所有
    当前记录的一天结束(任何其他期间可以使用)添加当前时间戳(或
    期间的任何其他ID,例如09.30.2016或Jan )到 时间点
    数组。

  1. Add property "current" of type Boolean which will be set to true for all the latest versions of the documents. If a new version is inserted - "current" will be set to false in an older version and set to true in a newer one.
  2. Add property "timepoints" which will contain multiple values. In the end of the day (any other period can be used) for all the current records add the current timestamp (or any other id of the period, e.g. "09.30.2016", or "Jan") to the "timepoints" array.

优点


  • 您可以轻松地在某个时间点检索当前记录,只需检查时间点在 timepoints 数组中。

  • You can easily retrieve the current records at some point of time just checking whether the time point is in the "timepoints" array.

您可以检索所有可用的时间点从单一查询的所有文档。

You can retrieve all the available time points from all the documents with a single query.

您可以按时间点进行聚合,例如在每个时间点计数所有的记录。

You can do the aggregation by time points, e.g. to count all the records at every point of time.

不需要维护多个索引,重复记录等,算法非常简单。 / p>

No need to maintain multiple indices, duplicates of the records etc., the algorithm is pretty straightforward.

缺点


  • 没有可能在任意时间点获取当前版本,只是在执行计算时获得当前版本。

  • No possibility to get the current versions at an arbitrary point of time, just the ones when the calculation was performed.

如果您经常运行计算并且您有数百万条记录,那么时间点数组的总体大小可能会显着增加。

The overall size of the "timepoints" arrays may increase significantly if you run the calculation too often and you have millions of records.

解决方法


  • 对于更细粒度的统计信息, 。但是,一天(或一个月或一年)每隔一段时间就会从 timepoints 数组中删除一些时间点。最后,你将有一个时间点,每年(如果是一年多以前),每个月(如果是一个多月前),每一天(如果是一个多月前),每天(如果是一年以前)一天以前),以及最近一段时间的每一小时。当然,根据您的需要,可以改进时间点的删除算法。

  • For more fine grained statistics run the calculation on an hourly basis. But once a day (or month, or year) remove some of the time points from the "timepoints" array for older periods of time. In the end you will have a set of time points that will correspond to every year (in case it was more than a year ago), every month (in case it was more than a month ago), every day (in case it was more than a day ago), and every hour for the latest period. Of course the algorithm of removal of time points can be improved according to you needs.

如果您主要使用最新版本的记录 - 存储它们在单独的索引中,将旧版本存储在另一个索引中。在这种情况下,您甚至不需要当前属性,只需运行当前索引中的所有记录并添加时间戳。

If you are mostly working with the latest versions of the records - store them in a separate index, store the older versions in another one. In this case you don't even need the "current" property, just run through all the records in your current index and add the time stamp.

如有需要,我可以为您提供上述步骤所需的所有查询。

I can provide you all the queries you need for the above mentioned steps in case of a need.

这篇关于获取最新的文档版本并汇总结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆