“随机"来自MongoDB的样本返回严重偏斜的结果 [英] "Random" sample from MongoDB returning heavily skewed results

查看：42 发布时间：2021/4/2 19:19:39 mongodb random aggregation-framework pymongo

本文介绍了“随机"来自MongoDB的样本返回严重偏斜的结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在MongoDB中有一个约有600,000个文档的集合.其中，恰好有一半的字段设置为0，而其他字段的同一字段设置为1.当我尝试使用

I have a collection in MongoDB with ~600,000 documents. Of those, exactly half have a field set to 0, while the others have the same field set to 1. When I try to get a random sample from this collection using the sample operation in the aggregation pipeline (via PyMongo), it skews heavily toward the 1 value.

在25,000个记录样本中，可能有300-400条记录，其中字段为0，然后有24,000+条记录，其中相关字段为1.

In a 25,000 record sample, there might be 300-400 records where the field is 0, and then 24,000+ records where the field in question is 1.

如果初始集合均匀分布，为什么使用 $ sample 返回的结果具有如此大的不同，我如何从集合中获取代表性样本?

If the initial collection is equally distributed, why is this use of $sample returning results with such a vastly different distribution, and how can I get a representative sample from a collection?

这是我用于查询的PyMongo行:

Here's the PyMongo line I'm using for the query:

cursor = foo_database.bar_collection.aggregate( [ { "$sample": { "size": 25000} } ])

推荐答案

从MongoDB 3.4.9开始，您观察到的偏差的部分原因是 $ sample 几乎完全依赖于存储引擎的随机游标实现(请参见 SERVER-19183 ).这样做是为了使 $ sample 在集合中包含大量数据时表现出色.但是，由于存储引擎使用B树类型的实现以排序的顺序存储文档，因此并非总是可以创建真正的随机结果.

As of MongoDB 3.4.9, part of the reason for the bias you've observed is that $sample relies almost entirely on the storage engine's random cursor implementation (see SERVER-19183). This is done so that $sample could be performant when the collection contains a lot of data. However, since the storage engine stores documents in a sorted order using a B-tree type implementation, it's not always possible to create a truly random result.

当前有两个功能请求，它们要求更好的 $ sample 机制，即 SERVER-22069 和 SERVER-22068 .

There are currently two feature requests for better $sample mechanics, namely SERVER-22069 and SERVER-22068.

话虽如此，如果您需要一个真正无偏的数据样本，那么滚动自己的类似 $ sample 的解决方案可能是此时的最佳方法.像这样:

Having said that, if you require a truly unbiased samples of your data, rolling your own $sample-like solution is likely the best way to proceed at this point. Something like:

获取集合中所有 _id 的列表.
对此列表进行随机抽样(例如，使用Python的随机.选择).
使用采样的 _id 获取所有相关文档，这取决于您想要的样本大小，因为 _id 始终会被索引.

Get a list of all _id in the collection.
Perform a random sampling on this list (e.g. using Python's random.choice).
Obtain all the relevant documents using the sampled _id, which will be reasonably performant depending on the sample size you want, since _id is always indexed.

这篇关于“随机"来自MongoDB的样本返回严重偏斜的结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

“随机"来自MongoDB的样本返回严重偏斜的结果 [英] "Random" sample from MongoDB returning heavily skewed results

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

“随机"来自MongoDB的样本返回严重偏斜的结果 [英] &quot;Random&quot; sample from MongoDB returning heavily skewed results

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

“随机"来自MongoDB的样本返回严重偏斜的结果 [英] "Random" sample from MongoDB returning heavily skewed results

登录关闭