“随机"来自MongoDB的样本返回严重偏斜的结果 [英] "Random" sample from MongoDB returning heavily skewed results
问题描述
我在MongoDB中有一个约有600,000个文档的集合.其中,恰好有一半的字段设置为0,而其他字段的同一字段设置为1.当我尝试使用
I have a collection in MongoDB with ~600,000 documents. Of those, exactly half have a field set to 0, while the others have the same field set to 1. When I try to get a random sample from this collection using the sample operation in the aggregation pipeline (via PyMongo), it skews heavily toward the 1 value.
在25,000个记录样本中,可能有300-400条记录,其中字段为0,然后有24,000+条记录,其中相关字段为1.
In a 25,000 record sample, there might be 300-400 records where the field is 0, and then 24,000+ records where the field in question is 1.
如果初始集合均匀分布,为什么使用 $ sample
返回的结果具有如此大的不同,我如何从集合中获取代表性样本?
If the initial collection is equally distributed, why is this use of $sample
returning results with such a vastly different distribution, and how can I get a representative sample from a collection?
这是我用于查询的PyMongo行:
Here's the PyMongo line I'm using for the query:
cursor = foo_database.bar_collection.aggregate( [ { "$sample": { "size": 25000} } ])
推荐答案
从MongoDB 3.4.9开始,您观察到的偏差的部分原因是 $ sample
几乎完全依赖于存储引擎的随机游标实现(请参见 SERVER-19183 ).这样做是为了使 $ sample
在集合中包含大量数据时表现出色.但是,由于存储引擎使用B树类型的实现以排序的顺序存储文档,因此并非总是可以创建真正的随机结果.
As of MongoDB 3.4.9, part of the reason for the bias you've observed is that $sample
relies almost entirely on the storage engine's random cursor implementation (see SERVER-19183). This is done so that $sample
could be performant when the collection contains a lot of data. However, since the storage engine stores documents in a sorted order using a B-tree type implementation, it's not always possible to create a truly random result.
当前有两个功能请求,它们要求更好的 $ sample
机制,即 SERVER-22069 和 SERVER-22068 .
There are currently two feature requests for better $sample
mechanics, namely SERVER-22069 and SERVER-22068.
话虽如此,如果您需要一个真正无偏的数据样本,那么滚动自己的类似 $ sample
的解决方案可能是此时的最佳方法.像这样:
Having said that, if you require a truly unbiased samples of your data, rolling your own $sample
-like solution is likely the best way to proceed at this point. Something like:
- 获取集合中所有
_id
的列表. - 对此列表进行随机抽样(例如,使用Python的随机.选择).
- 使用采样的
_id
获取所有相关文档,这取决于您想要的样本大小,因为_id
始终会被索引.
- Get a list of all
_id
in the collection. - Perform a random sampling on this list (e.g. using Python's random.choice).
- Obtain all the relevant documents using the sampled
_id
, which will be reasonably performant depending on the sample size you want, since_id
is always indexed.
这篇关于“随机"来自MongoDB的样本返回严重偏斜的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!