“随机"来自MongoDB的样本返回严重偏斜的结果 [英] "Random" sample from MongoDB returning heavily skewed results

查看:42
本文介绍了“随机"来自MongoDB的样本返回严重偏斜的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在MongoDB中有一个约有600,000个文档的集合.其中,恰好有一半的字段设置为0,而其他字段的同一字段设置为1.当我尝试使用

I have a collection in MongoDB with ~600,000 documents. Of those, exactly half have a field set to 0, while the others have the same field set to 1. When I try to get a random sample from this collection using the sample operation in the aggregation pipeline (via PyMongo), it skews heavily toward the 1 value.

在25,000个记录样本中,可能有300-400条记录,其中字段为0,然后有24,000+条记录,其中相关字段为1.

In a 25,000 record sample, there might be 300-400 records where the field is 0, and then 24,000+ records where the field in question is 1.

如果初始集合均匀分布,为什么使用 $ sample 返回的结果具有如此大的不同,我如何从集合中获取代表性样本?

If the initial collection is equally distributed, why is this use of $sample returning results with such a vastly different distribution, and how can I get a representative sample from a collection?

这是我用于查询的PyMongo行:

Here's the PyMongo line I'm using for the query:

cursor = foo_database.bar_collection.aggregate( [ { "$sample": { "size": 25000} } ])

推荐答案

从MongoDB 3.4.9开始,您观察到的偏差的部分原因是 $ sample 几乎完全依赖于存储引擎的随机游标实现(请参见 SERVER-19183 ).这样做是为了使 $ sample 在集合中包含大量数据时表现出色.但是,由于存储引擎使用B树类型的实现以排序的顺序存储文档,因此并非总是可以创建真正的随机结果.

As of MongoDB 3.4.9, part of the reason for the bias you've observed is that $sample relies almost entirely on the storage engine's random cursor implementation (see SERVER-19183). This is done so that $sample could be performant when the collection contains a lot of data. However, since the storage engine stores documents in a sorted order using a B-tree type implementation, it's not always possible to create a truly random result.

当前有两个功能请求,它们要求更好的 $ sample 机制,即 SERVER-22069 SERVER-22068 .

There are currently two feature requests for better $sample mechanics, namely SERVER-22069 and SERVER-22068.

话虽如此,如果您需要一个真正无偏的数据样本,那么滚动自己的类似 $ sample 的解决方案可能是此时的最佳方法.像这样:

Having said that, if you require a truly unbiased samples of your data, rolling your own $sample-like solution is likely the best way to proceed at this point. Something like:

  1. 获取集合中所有 _id 的列表.
  2. 对此列表进行随机抽样(例如,使用Python的随机.选择).
  3. 使用采样的 _id 获取所有相关文档,这取决于您想要的样本大小,因为 _id 始终会被索引.
  1. Get a list of all _id in the collection.
  2. Perform a random sampling on this list (e.g. using Python's random.choice).
  3. Obtain all the relevant documents using the sampled _id, which will be reasonably performant depending on the sample size you want, since _id is always indexed.

这篇关于“随机"来自MongoDB的样本返回严重偏斜的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆