从配置单元表中匹配大小随机样本 [英] Matched size random samples from hive table

查看:141
本文介绍了从配置单元表中匹配大小随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个配置单元表 activity with columns userid itemid 评级,可能评分为1和0,其中有更多正面评级(1s),然后是负面评级(0s)。我需要提取大致相同数量的正面和负面评级的样本。我需要这个样本尽可能大,所以想要对所有负面的评级行进行抽样,加上相同数量的正面评级行,随机抽样

例如,假设我们在表中有100k行总数,75k的评分= 1,25k的评分= 0。什么是最有效的查询(或查询)返回评级= 0的所有25k行和评级= 1的25k随机抽样行?实际的表格要大得多,所以速度在这里很重要。 解决方案

如果您事先知道底片是限制因素,你可以用第一个查询得到确切的数字(比如说N)。
然后你可以用(硬编码N在这里)获得整个样本

  select * from 

select * from activity where rating = 1 order by rand()limit N
union all
select * from activity where rating = 0
)all_sample
order by rand( )限制2N

根据您的需要,最后的订单可能不是必需的。

I have a hive table activity with columns userid, itemid, and rating, with possible ratings of 1 and 0, in which there are many more positive ratings (1s) then negative ratings (0s). I need to extract a sample with approximately equal numbers of positive and negative ratings. I need this sample to be as large as possible, so want to sample all the negative rating rows, plus an equal number of positive rating rows, sampled randomly.

For example, let's say we have 100k total rows in the table, 75k with rating=1, and 25k with rating=0. What is the most efficient query (or queries) to return all 25k rows with rating=0 and 25k randomly sampled rows with rating=1? The actual tables are much larger, so speed is important here.

解决方案

If you know in advance that negatives are the limiting factor, you can get the exact number with the first query (let's say N). Then you can get the entire sample with (hardcode N here)

select * from
(
  select * from activity where rating=1 order by rand() limit N
  union all
  select * from activity where rating=0  
) all_sample
order by rand() limit 2N

the last order may not be necessary, depending on your need.

这篇关于从配置单元表中匹配大小随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆