如何在Redshift中进行分层随机抽样? [英] How to do a stratified random sample in Redshift?

查看:101
本文介绍了如何在Redshift中进行分层随机抽样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对从不同类别购买的客户进行随机抽样.有8个类别,订单分布在其中.

I need to take a random sample of customers who have purchased from different categories. There are 8 categories, and orders are spread among them.

如果我想随机抽取已购买商品的客户,但保持每个类别的订单比例不变,我该如何在我的sql代码中进行设置?

If I wanted to take a random sample of customers who have made a purchase, but keep the proportion of orders per category the same, how would I set that up in my sql code?

以下是突出显示此内容的表格(不包含客户数据-我希望我的客户列表基于订单的代表比例):

A table highlighting this is below (it doesn't include customer data - I want my list of customers to be based of the representative proportion of orders):

表可以在这里找到: https://imgur.com/a/Q0lMHWf

推荐答案

基本上,您将使用 row_number() order by ,然后选择第n个值.因此,对于约1%的分层样本,请执行以下操作:

Basically, you would use row_number(), order by, and choose the nth values. So, for about a 1% stratified sample, do:

select t.*
from (select t.*,
             row_number() over (order by category, rand()) as seqnum
      from t
     ) t
where mod(seqnum, 101) = 1
order by category;

基本思想是,您可以通过按类别对结果集进行排序并对结果进行第n个样本来获得分层样本.

The basic idea is that you can get a stratified sample by ordering the result set by the categories and doing an nth sample of the result.

这篇关于如何在Redshift中进行分层随机抽样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆