带组的SQL随机样本 [英] SQL random sample with groups

查看:147
本文介绍了带组的SQL随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大学毕业生数据库,想提取约1000条记录的随机数据样本.

我想确保样本能代表总体,因此想包括相同比例的课程,例如

我可以使用以下方法做到这一点:

select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()

但是我们有数百门课程代码,因此这将很耗时,我希望能够针对不同的样本量重用此代码,并且不特别希望通过查询来对样本量进行硬编码./p>

任何帮助将不胜感激

解决方案

您需要分层样本.我建议通过按课程代码对数据进行排序并进行第n个示例来做到这一点.如果人口众多,这是最有效的一种方法:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

您还可以即时"计算每个组的人口规模:

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

I have a university graduate database and would like to extract a random sample of data of around 1000 records.

I want to ensure the sample is representative of the population so would like to include the same proportions of courses eg

I could do this using the following:

select top 500 id from degree where coursecode = 1 order by newid()
union
select top 300 id from degree where coursecode = 2 order by newid()
union
select top 200 id from degree where coursecode = 3 order by newid()

but we have hundreds of courses codes so this would be time consuming and I would like to be able to reuse this code for different sample sizes and don't particularly want to go through the query and hard code the sample sizes.

Any help would be greatly appreciated

解决方案

You want a stratified sample. I would recommend doing this by sorting the data by course code and doing an nth sample. Here is one method that works best if you have a large population size:

select d.*
from (select d.*,
             row_number() over (order by coursecode, newid) as seqnum,
             count(*) over () as cnt
      from degree d
     ) d
where seqnum % (cnt / 500) = 1;

EDIT:

You can also calculate the population size for each group "on the fly":

select d.*
from (select d.*,
             row_number() over (partition by coursecode order by newid) as seqnum,
             count(*) over () as cnt,
             count(*) over (partition by coursecode) as cc_cnt
      from degree d
     ) d
where seqnum < 500 * (cc_cnt * 1.0 / cnt)

这篇关于带组的SQL随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆