带有Hive的随机样本表,但包含匹配的行 [英] Random sample table with Hive, but including matching rows

查看:119
本文介绍了带有Hive的随机样本表,但包含匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 userID 列和其他用户变量列的大表,我想使用Hive根据用户的<$ code>用户ID 。此外,有时这些用户将在多行上,如果随机选择的 userID 包含在表的其他部分,我也想提取这些行。



我查看了 Hive抽样文档,我发现可以这样做来提取1%的样本:

  SELECT * FROM source 
TABLESAMPLE(1 PERCENT)s;

但我不确定如何添加约束条件,我希望所有其他1% userID s也可以。 可以使用rand()随机分配数据,并在您的类别中使用适当的用户ID百分比。我推荐rand(),因为设置种子可以使结果重复。

  select c。* 
from
(从select b $ b选择userID
,如果(rand(5555)<0.1,'test','train')以类型
结尾
group by userID
)a
)b
右外连接
(从用户ID
中选择*
)c
on a.userid = c.userid
where type ='test'
;

这是为实体级建模目的设置的,这就是为什么我将测试和训练作为类型。

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.

I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:

SELECT * FROM source 
TABLESAMPLE (1 PERCENT) s;

but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.

解决方案

You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.

select c.*
from 
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
    from
    (select userID 
    from mytable 
    group by userID
    ) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;

This is set up for entity level modeling purposes, which is why I have test and train as types.

这篇关于带有Hive的随机样本表,但包含匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆