来自Sql数据库的简单随机样本 [英] Simple Random Samples from a Sql database

查看:101
本文介绍了来自Sql数据库的简单随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在SQL中获取有效的简单随机样本?所讨论的数据库正在运行MySQL.我的表至少有200,000行,我想要一个大约10,000的简单随机样本.

How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.

显而易见"的答案是:

SELECT * FROM table ORDER BY RAND() LIMIT 10000

对于大型表,这太慢了:它对每一行都调用RAND()(已经将其放在O(n)),并对它们进行排序,使其充其量为O(n lg n).有没有比O(n)更快的方法?

For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?

注意:正如Andrew Mao在评论中指出的那样,如果在SQL Server上使用这种方法,则应该使用T-SQL函数NEWID(),因为RAND()可能为所有行返回相同的值.

Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID(), because RAND() may return the same value for all rows.

5年后

我再次遇到了一个更大的表的问题,并最终使用了@ignorant解决方案的版本,并进行了两次调整:

I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:

  • 将行采样到所需样本大小的2-5倍,以便宜的价格订购RAND()
  • 在每次插入/更新时将RAND()的结果保存到索引列中. (如果您的数据集不是很重更新,则可能需要寻找另一种方法来保持此列的最新状态.)

要对表进行1000项采样,我需要对行进行计数,并使用Frozen_rand列对结果进行平均采样,平均减少到10,000行:

To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:

SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high

  SELECT *
    FROM table
   WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000

(我的实际实现涉及更多的工作,以确保我不会采样不足,并手动将rand_high环绕,但是基本思想是将N随机减少到几千.")

(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")

尽管这样做会有所牺牲,但它允许我使用索引扫描对数据库进行采样,直到足够小以再次进行ORDER BY RAND()为止.

While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND() again.

推荐答案

此处有关于此类型问题的非常有趣的讨论:

There's a very interesting discussion of this type of issue here: http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

我认为在没有任何假设的情况下,您的O(n lg n)解决方案是最好的.尽管实际上使用好的优化器或稍微不同的技术,但您列出的查询可能会更好一些,O(m * n)其中m是所需的随机行数,因为它不必对整个大型数组进行排序,它可能只搜索最小的m次.但是对于您发布的那种数字,无论如何,m大于lg n.

I think with absolutely no assumptions about the table that your O(n lg n) solution is the best. Though actually with a good optimizer or a slightly different technique the query you list may be a bit better, O(m*n) where m is the number of random rows desired, as it wouldn't necesssarily have to sort the whole large array, it could just search for the smallest m times. But for the sort of numbers you posted, m is bigger than lg n anyway.

我们可以尝试三种假设:

Three asumptions we might try out:

  1. 表中有一个唯一的索引主键

  1. there is a unique, indexed, primary key in the table

您要选择的随机行数(m)远小于表(n)中的行数

the number of random rows you want to select (m) is much smaller than the number of rows in the table (n)

唯一主键是一个介于1到n之间且没有空格的整数

the unique primary key is an integer that ranges from 1 to n with no gaps

仅假设1和2,我认为这可以在O(n)中完成,尽管您需要向表中写入一个完整的索引以匹配假设3,因此不一定需要快速的O(n).如果我们可以另外假设该表有其他优点,则可以在O(m log m)中执行该任务.假设3是一个易于使用的好属性.有了一个很好的随机数生成器,它可以保证在连续生成m个数时不会重复,因此O(m)解决方案将是可能的.

With only assumptions 1 and 2 I think this can be done in O(n), though you'll need to write a whole index to the table to match assumption 3, so it's not necesarily a fast O(n). If we can ADDITIONALLY assume something else nice about the table, we can do the task in O(m log m). Assumption 3 would be an easy nice additional property to work with. With a nice random number generator that guaranteed no duplicates when generating m numbers in a row, an O(m) solution would be possible.

基于三个假设,基本思想是生成介于1和n之间的m个唯一随机数,然后从表中选择具有这些键的行.我现在没有mysql或任何更新,所以用稍微的伪代码看起来像:

Given the three assumptions, the basic idea is to generate m unique random numbers between 1 and n, and then select the rows with those keys from the table. I don't have mysql or anything in front of me right now, so in slightly pseudocode this would look something like:


create table RandomKeys (RandomKey int)
create table RandomKeysAttempt (RandomKey int)

-- generate m random keys between 1 and n
for i = 1 to m
  insert RandomKeysAttempt select rand()*n + 1

-- eliminate duplicates
insert RandomKeys select distinct RandomKey from RandomKeysAttempt

-- as long as we don't have enough, keep generating new keys,
-- with luck (and m much less than n), this won't be necessary
while count(RandomKeys) < m
  NextAttempt = rand()*n + 1
  if not exists (select * from RandomKeys where RandomKey = NextAttempt)
    insert RandomKeys select NextAttempt

-- get our random rows
select *
from RandomKeys r
join table t ON r.RandomKey = t.UniqueKey

如果您真的很担心效率,则可以考虑使用某种过程语言来生成随机密钥,并将结果插入数据库中,因为除SQL以外,几乎任何其他方法在循环和随机处理方面都可能会更好.需要生成数字.

If you were really concerned about efficiency, you might consider doing the random key generation in some sort of procedural language and inserting the results in the database, as almost anything other than SQL would probably be better at the sort of looping and random number generation required.

这篇关于来自Sql数据库的简单随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆