如何从CouchDB中加载随机文档(高效且公平)? [英] How do I load a random document from CouchDB (efficiently and fairly)?

查看:81
本文介绍了如何从CouchDB中加载随机文档(高效且公平)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从存储在CouchDB数据库中的一组文档中加载随机文档.提取和加载文档的方法应符合以下要求:

I would like to load a random document out of a set of documents stored in a CouchDB database. The method for picking and loading the document should conform to the following requirements:

  • 效率:文档的查找应该高效,最重要的是,加载文档的时间不能随文档总数线性增长.这意味着不能使用 skip 查询参数.

均匀分布:选择应该是真正随机的(尽可能使用标准随机数生成器),每个文档的选择机会均等.

Uniform distribution: The choice should be truly random (as far as possible, using standard random number generators), every document should have equal chances of being chosen.

在CouchDB中实现此目的的最佳方法是什么?

What is the best way to implement this in CouchDB?

推荐答案

再三考虑之后,我想出了一个解决方案.为了完整起见,我将首先展示两种简单的方法并解释为什么它们有缺陷.第三种解决方案是我要使用的解决方案.

After giving this some more thought, I came up with a solution. For completeness sake, I will first show two simple approaches and explain why they are flawed. The third solution is the one I'm going with.

这是一个简单的解决方案:您有一个简单的视图(我们称其为random),其中包含一个map函数,该函数会发出您要选择的所有文档,并且内置了_count reduce函数.要选择随机文档,请按照以下步骤操作:

This is the trivial solution: You have a simple view (let's call it random) with a map function that emits all documents you want to choose from and the built-in _count reduce function. To pick a random document, follow these steps:

  • 通过调用以下命令在视图中查找文档总数N:
    http://localhost:5984/db/_design/d/_view/random
  • 选择随机数0 <= i < N
  • 加载第i个文档:
    http://localhost:5984/db/_design/d/_view/random?reduce=false&skip=i&limit=1
  • Find the total number of documents N in the view by calling:
    http://localhost:5984/db/_design/d/_view/random
  • Pick random number 0 <= i < N
  • Load the i'th document:
    http://localhost:5984/db/_design/d/_view/random?reduce=false&skip=i&limit=1

这种方法不好,因为它不能很好地扩展大量文档.根据"CouchDB-权威指南"的这一部分,只应使用skip参数可以与一位数字值一起使用.

This approach is bad because it doesn't scale well for large numbers of documents. According to this section of "CouchDB - The Definitive Guide" the skip argument should only be used with single-digit values.

上面的解决方案必须返回i文档,然后再返回所选择的文档.用SQL术语来说,这等效于全表扫描,而不是索引查找.

The solution above would have to loop through i documents before returning the chosen one. In SQL terms it's the equivalent of a full table scan as opposed to an index lookup.

使用这种方法,在创建时会为每个文档生成一个随机数,并将其存储在文档中.示例文件:

With this approach, a random number is generated for each document at creation time and stored in the document. An example document:

{
  _id: "4f12782c39474fd0a498126c0400708c",
  rand: 0.4591819887660398,
  // actual data...
}

random视图具有以下映射功能:

The random view then has the following map function:

function(doc) {
  if (doc.rand) {
    emit(doc.rand, doc);
  }
}      

以下是选择随机文档的步骤:

These are the steps to pick a random document:

  • 选择一个随机数0 <= r < 1
  • 加载文档:
    http://localhost:5984/db/_design/d/_view/random?startkey=r&limit=1
  • 如果未返回任何文档(因为r大于数据库中存储的最大随机数),请环绕并加载第一个文档.
  • Pick a random number 0 <= r < 1
  • Load the document:
    http://localhost:5984/db/_design/d/_view/random?startkey=r&limit=1
  • If no document is returned (because r is larger than the largest random number stored in the database), wrap around and load the first document.

这非常快,一见钟情.但是,存在一个严重的缺陷:并非所有文档都具有相同的被拣选机会.

This is very fast and looks great at first sight. However, there's a serious flaw: not all documents have the same chance of being picked.

在最简单的示例中,数据库中有两个文档.当我选择随机文档的次数非常多时,我希望每个文档的出现时间都减少一半.假设在创建时为文档分配了随机数0.2和0.9.因此,当(r <= 0.2) or (r > 0.9)时选择文档A,而0.2 < r <= 0.9时选择文档B.不是每个文档被拣选的机率不是50%,而是A拣选的机率是30%,B拣选的机率是70%.

In the most simple example, there are two documents in the database. When I choose a random document a very large number of times, I want each document to come up half of the time. Let's say the documents were assigned the random numbers 0.2 and 0.9 at creation time. So document A is picked when (r <= 0.2) or (r > 0.9) and document B is chosen when 0.2 < r <= 0.9. The chance of being picked is not 50% for each document, but 30% for A and 70% for B.

当数据库中有更多文档时,您可能会认为情况有所改善,但实际上并没有.文档之间的间隔变得越来越小,但是间隔大小的变化甚至变得更糟:想象一下三个文档A,B和C,它们的随机数分别为0.30001057、0.30002057和0.30002058(中间没有其他文档).选择B的机会比选择C的机会大1000倍.在最坏的情况下,两个文档被分配相同的随机数.然后只能找到其中一个(一个具有较低文档ID的文件),另一个实际上是不可见的.

You might think the situation improves when there are more documents in the database, but it really doesn't. The intervals between documents get smaller, but the variation in interval size get's even worse: Imagine three documents A, B and C with the random numbers 0.30001057, 0.30002057 and 0.30002058 (no other documents are in between). The chances of B being chosen are 1000 times greater than C being chosen. In the worst case, two documents are assigned the same random number. Then only one of them can be found (the one with the lower document id), the other is essentially invisible.

我想出的解决方案将方法2的速度与方法1的公平性结合在一起.这里是:

The solution I came up with combines the speed of approach 2 with the fairness of approach 1. Here it is:

与方法2一样,每个文档在创建时都分配有一个随机数,该视图使用相同的地图功能.与方法1一样,我也有一个_count reduce函数.

As in approach 2, each document is assigned a random number at creation time, the same map function is used for the view. As in approach 1, I also have a _count reduce function.

以下是加载随机文档的步骤:

These are the steps for loading a random document:

  • 通过调用以下内容在视图中查找文档总数N:
    http://localhost:5984/db/_design/d/_view/random
  • 选择随机数0 <= r < 1
  • 计算随机索引:i = floor(r*N)
    我的目标是加载第i个文档(与方法1一样).假设随机数的分布大致相同,我猜第i个文档的随机值大约为r.
  • 查找随机值小于r的文档L的数量: http://localhost:5984/db/_design/d/_view/random?endkey=r
  • 看看我们的猜测还有多远:s = i - L
  • if (s>=0)
    http://localhost:5984/db/_design/d/_view/random?startkey=r&skip=s&limit=1&reduce=false
  • if (s<0)
    http://localhost:5984/db/_design/d/_view/random?startkey=r&skip=-(s+1)&limit=1&descending=true&reduce=false
  • Find the total number of documents N in the view by calling:
    http://localhost:5984/db/_design/d/_view/random
  • Pick random number 0 <= r < 1
  • Calculate random index: i = floor(r*N)
    My goal is to load the i'th document (as in approach 1). Assuming the distribution of random numbers is more or less uniform, I'm guessing the i'th document has a random value of approximately r.
  • Find the number of documents L with a random value lower than r: http://localhost:5984/db/_design/d/_view/random?endkey=r
  • See how far off our guess is: s = i - L
  • if (s>=0)
    http://localhost:5984/db/_design/d/_view/random?startkey=r&skip=s&limit=1&reduce=false
  • if (s<0)
    http://localhost:5984/db/_design/d/_view/random?startkey=r&skip=-(s+1)&limit=1&descending=true&reduce=false

因此,诀窍是猜测分配给第i个文档的随机数,查找该数字,查看我们离我们有多远,然后跳过丢失的文档数.

So, the trick is to guess the random number assigned to the i'th document, look that up, see how far we're off and then skip the number of documents by which we missed.

即使对于大型数据库,跳过的文档数也应保持较小,因为猜测的准确性将随着文档数的增加而增加.我的猜测是s在数据库增长时保持不变,但是我没有尝试过,而且我不具备从理论上证明它的资格.

The number of documents skipped should remain small even for large databases, since the accuracy of the guess will increase with the number of documents. My guess is that s remains constant when the database grows, but I haven't tried and I don't feel qualified to prove it theoretically.

如果您有更好的解决方案,我将非常有兴趣!

If you have a better solution, I'd be very interested!

这篇关于如何从CouchDB中加载随机文档(高效且公平)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆