从Oracle进行采样,需要准确的结果数(采样子句) [英] Sampling from Oracle, Need exact number of results (Sample Clause)

查看:135
本文介绍了从Oracle进行采样,需要准确的结果数(采样子句)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Peoplesoft数据库中随机抽取总体样本.在线搜索使我认为select语句的Sample子句对于我们来说可能是一个可行的选择,但是我很难理解Sample子句如何确定返回的样本数.我查看了在这里找到的oracle文档: http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2065953

I am trying to pull a random sample of a population from a Peoplesoft Database. The searches online have lead me to think that the Sample Clause of the select statement may be a viable option for us to use, however I am having trouble understanding how the Sample clause determines the number of samples returned. I have looked at the oracle documentation found here: http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2065953

但是上面的参考仅讨论了用于创建示例的语法.我提出这个问题的原因是我需要了解样本百分比如何确定返回的样本量.似乎将随机数应用于您要查询的百分比,然后使用种子数来计数每个"n"条记录.我们的要求是,例如,提取确切数量的样本,它们是随机选择的,并且它们代表整个表(或者至少代表我们使用过滤器选择的数据分组)

But the above reference only talks about the syntax used to create the sample. The reason for my question is I need to understand how the sample percent determines the sample size returned. It seems like it applies a random number to the percent you ask for and then uses a seed number to count every "n" records. Our requirement is that we pull an exact number of samples for example, that they are randomly selected, and that they are representative of the entire table (or at least the grouping of data we choose with filters)

在10200件商品中,如果我需要大约100件商品的样本,则可以使用以下语句:

In a population of 10200 items if I need a sample of approximately 100 items, I could use this statement:

SELECT * FROM PS_LEDGER SAMPLE(1) --1 % of my total population
WHERE DEPTID = '700064' 

但是,我们需要提取确切数量的样本(在这种情况下为100),因此我可以选择一个样本大小,该样本大小几乎总是返回比我需要的数量更多的样本,然后将其缩减为IE

However, We need to pull an exact number of samples (in this case 100) so I could pick a sample size that almost always returns more than the number I need then trim it down IE

SELECT Count(*) FROM PS_LEDGER SAMPLE(2.5) --this percent must always give > 100 items
WHERE DEPTID = '700064' and rownum < 101

我对此的担心是,我的样本不能统一代表整个人口.例如,如果样本函数在创建自己的随机生成的种子后仅拉出每条N条记录,则选择rownum< 101将切断从表底部选择的所有记录.我正在寻找一种从表中准确提取100条记录的方法,这些记录是随机选择的,并且可以代表整个表.请帮忙!

My concern with doing that, is that my sample would not uniformly represent the entire population. For example if the sample function just pulls every N record after it creates its own randomly generated seed, then choosing the rownum < 101 will cut off all of the records chosen from the bottom of the table. What I am looking for is a way to pull out exactly 100 records from the table, which are randomly selected and fairly representative of the entire table. Please help!!

推荐答案

借用jonearles的示例表,我看到的完全一样(在OEL开发人员图像上的11gR2中),通常获取a的值严重偏向1;如果样本量很小,我有时根本看不到.我在评论中提到了额外的随机化/限制步骤:

Borrowing jonearles' example table, I see exactly the same thing (in 11gR2 on an OEL developer image), usually getting values for a heavily skewed towards 1; with small sample sizes I can sometimes see none at all. With the extra randomisation/restriction step I mentioned in a comment:

select a, count(*) from (
    select * from test1 sample (1)
    order by dbms_random.value
)
where rownum < 101
group by a;

...我进行了三轮跑步:

... with three runs I got:

         A   COUNT(*)
---------- ----------
         1         71
         2         29

         A   COUNT(*)
---------- ----------
         1        100

         A   COUNT(*)
---------- ----------
         1         64
         2         36

是的,第二轮比赛中100%真的以1的身份回来了.偏斜本身似乎是相当随机的.我尝试使用了block修饰符,该修饰符似乎没什么区别,也许令人惊讶-我可能以为在这种情况下它会变得更糟.

Yes, 100% really came back as 1 on the second run. The skewing itself seems to be rather random. I tried with the block modifier which seemed to make little difference, perhaps surprisingly - I might have thought it would get worse in this situation.

对于小样本量来说,这可能会变慢,因为它必须击中整个表;但确实确实给了我相当一致的分裂效果:

This is likely to be slower, certainly for small sample sizes, as it has to hit the entire table; but does give me pretty even splits fairly consistently:

select a, count(*) from (
    select a, b from (
        select a, b, row_number() over (order by dbms_random.value) as rn
        from test1
    )
    where rn < 101
)
group by a;

经过三轮测试,我得到了:

With three runs I got:

         A   COUNT(*)
---------- ----------
         1         48
         2         52

         A   COUNT(*)
---------- ----------
         1         57
         2         43

         A   COUNT(*)
---------- ----------
         1         49
         2         51

...看起来更健康.当然是YMMV.

... which looks a bit healthier. YMMV of course.

这篇Oracle文章涵盖了一些采样技术,您可能希望评估ora_hash方法也是如此,如果您的数据传播且您对代表性"有要求,则使用分层版本.

This Oracle article covers some sampling techniques, and you might want to evaluate the ora_hash approach as well, and the stratified version if your data spread and your requirements for 'representativeness' demand it.

这篇关于从Oracle进行采样,需要准确的结果数(采样子句)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆