带有Hive的随机样本表,但包含匹配的行 [英] Random sample table with Hive, but including matching rows
问题描述
我有一个包含 userID
列和其他用户变量列的大表,我想使用Hive根据用户的<$ code>用户ID 。此外,有时这些用户将在多行上,如果随机选择的 userID
包含在表的其他部分,我也想提取这些行。
我查看了 Hive抽样文档,我发现可以这样做来提取1%的样本:
SELECT * FROM source
TABLESAMPLE(1 PERCENT)s;
但我不确定如何添加约束条件,我希望所有其他1% userID
s也可以。 可以使用rand()随机分配数据,并在您的类别中使用适当的用户ID百分比。我推荐rand(),因为设置种子可以使结果重复。
select c。*
from
(从select b $ b选择userID
,如果(rand(5555)<0.1,'test','train')以类型
结尾
group by userID
)a
)b
右外连接
(从用户ID
中选择*
)c
on a.userid = c.userid
where type ='test'
;
这是为实体级建模目的设置的,这就是为什么我将测试和训练作为类型。
I have a large table containing a userID
column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID
. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID
is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userID
s selected too.
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.*
from
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
from
(select userID
from mytable
group by userID
) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;
This is set up for entity level modeling purposes, which is why I have test and train as types.
这篇关于带有Hive的随机样本表,但包含匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!