从大型数据集中随机抽样 [英] Random sampling from a large dataset
问题描述
有一个大型数据库,我从中提取了研究人群.出于比较的目的,我想选择一个具有相似特征的对照组.我想要匹配的两个标准是年龄和性别.为我提供用于匹配目的的数字的查询是
There's a large database from which I have extracted a study population. For comparison purposes, I would like to select a control group that has similar characteristics. The two criteria on what I would like to match are age and gender. The query to give me the numbers that I want for matching purposes is
select sex, age/10 as decades,COUNT(*) as counts
from
(
select distinct m.patid
,m.sex,DATEPART(year,min(c.admitdate)) -m.yrdob as Age
from members as m
inner join claims as c on c.patid=m.PATID
group by m.PATID, m.sex,m.yrdob
)x group by sex, Age/10
结果集看起来像
这个年龄的decades列由表达式给出
The decades column in this age is given by the expression
(DATEPART(year,min(c.admitdate)) -m.yrdob)/10
这用于使用整数除法查找年龄范围为 20-29、30-39 等的人.例如,我想从更大的数据集中选择 507 名 20 多岁的女性.查找较大数据集特征的查询是
and this is used to find people in the age ranges of 20-29, 30-39 etc using integer division. From a larger dataset I would like to select, for instance, 507 females who are in their 20s. The query to find the characteristics of the larger dataset is
select distinct m.patid
,m.sex
,(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
from members as m
inner join claims as c on c.patid=m.PATID
group by m.PATID, m.sex,m.yrdob
第二次查询的结果
所以我需要第二个查询中的十年列的 sum
等于第一个查询中的 counts
.我尝试过的(并返回零结果)如下.我需要做些什么来匹配这些年龄?
So I need the sum
of the decades column in the second query to equal counts
in the first query. What I tried (and returned zero results) is below. What do I need to do to match these ages?
运行但不返回结果的查询:
query that runs, but returns no results:
select x.PATID--,x.sex,x.decades,y.counts
from
(
select distinct m.patid
,m.sex
,(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
from members as m
inner join claims as c on c.patid=m.PATID
group by m.PATID, m.sex,m.yrdob
) as x
inner join
(
select sex, age/10 as decades,COUNT(*) as counts
from
(
select distinct m.patid
,m.sex,DATEPART(year,min(c.admitdate)) -m.yrdob as Age
from members as m
inner join claims as c on c.patid=m.PATID
group by m.PATID, m.sex,m.yrdob
)x group by sex, Age/10
) as y on x.sex=y.sex and x.decades=y.decades
group by y.counts,x.PATID,x.sex,y.sex
having SUM(x.decades)=y.counts and x.sex=y.sex
推荐答案
select
T1.sex,
T1.decades,
T1.counts,
T2.patid
from (
select
sex,
age/10 as decades,
COUNT(*) as counts
from (
select m.patid,
m.sex,
DATEPART(year,min(c.admitdate)) -m.yrdob as Age
from members as m
inner join claims as c on c.patid=m.PATID
group by m.PATID, m.sex,m.yrdob
)x
group by sex, Age/10
) as T1
join (
--right here is where the random sampling occurs
SELECT TOP 50--this is the total number of peolpe in our dataset
patid
,sex
,decades
from (
select m.patid,
m.sex,
(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
from members as m
inner join claims as c on c.patid=m.PATID
group by m.PATID, m.sex, m.yrdob
) T2
order by NEWID()
) as T2
on T2.sex = T1.sex
and T2.decades = T1.decades
我发布了另一个与此类似的问题,其中我发现我的结果实际上并不是随机的,但它们只是前 N 个结果.我在最外面的查询中通过 newid()
进行了排序,所做的只是围绕完全相同的结果集进行混洗.从现在已关闭的问题中,我发现我需要在上述查询的注释行中使用 TOP
关键字和 order by newid()
.
I had posted another question similar to this in which I found that my results weren't in fact random, but they were only the TOP N results. I had ordered by newid()
in the outermost query and all that was doing was shuffling around the exact same result set. From a question that is now closed, I found out that I needed to use the TOP
keyword along with order by newid()
in the commented line in the above query.
这篇关于从大型数据集中随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!