来自 hive 表的匹配大小随机样本 [英] Matched size random samples from hive table

查看:34
本文介绍了来自 hive 表的匹配大小随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个配置单元表 activity,其中包含 useriditemidrating 列,可能的评分为1 和 0,其中正面评分 (1) 多于负面评分 (0).我需要提取一个具有大致相等数量的正面和负面评级的样本.我需要这个样本尽可能大,所以想要对所有负面评级行进行采样,加上相等数量的正面评级行,随机采样.

I have a hive table activity with columns userid, itemid, and rating, with possible ratings of 1 and 0, in which there are many more positive ratings (1s) then negative ratings (0s). I need to extract a sample with approximately equal numbers of positive and negative ratings. I need this sample to be as large as possible, so want to sample all the negative rating rows, plus an equal number of positive rating rows, sampled randomly.

例如,假设表中有 100k 行,其中 75k 的 rating=1,25k 的 rating=0.返回所有 25k 行的 rating=0 和 25k 随机抽样的行 rating=1 的最有效查询(或查询)是什么?实际的桌子要大得多,所以速度在这里很重要.

For example, let's say we have 100k total rows in the table, 75k with rating=1, and 25k with rating=0. What is the most efficient query (or queries) to return all 25k rows with rating=0 and 25k randomly sampled rows with rating=1? The actual tables are much larger, so speed is important here.

推荐答案

如果您事先知道负数是限制因素,您可以通过第一个查询获得确切的数字(假设为 N).然后你可以得到整个样本(这里是硬编码 N)

If you know in advance that negatives are the limiting factor, you can get the exact number with the first query (let's say N). Then you can get the entire sample with (hardcode N here)

select * from
(
  select * from activity where rating=1 order by rand() limit N
  union all
  select * from activity where rating=0  
) all_sample
order by rand() limit 2N

最后一个订单可能不是必需的,这取决于您的需要.

the last order may not be necessary, depending on your need.

这篇关于来自 hive 表的匹配大小随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆