带有Hive的随机样本表，但包含匹配的行 [英] Random sample table with Hive, but including matching rows

查看：119 发布时间：2018/6/12 14:02:11 hive hiveql random-sample

本文介绍了带有Hive的随机样本表，但包含匹配的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含 userID 列和其他用户变量列的大表，我想使用Hive根据用户的<$ code>用户ID 。此外，有时这些用户将在多行上，如果随机选择的 userID 包含在表的其他部分，我也想提取这些行。

我查看了 Hive抽样文档，我发现可以这样做来提取1％的样本：

  SELECT * FROM source 
 TABLESAMPLE（1 PERCENT）s;

但我不确定如何添加约束条件，我希望所有其他1％ userID s也可以。 可以使用rand（）随机分配数据，并在您的类别中使用适当的用户ID百分比。我推荐rand（），因为设置种子可以使结果重复。

  select c。* 
 from 
（从select b $ b选择userID 
，如果（rand（5555）<0.1，'test'，'train'）以类型
结尾
 group by userID 
）a 
）b 
右外连接
（从用户ID 
中选择* 
）c 
 on a.userid = c.userid 
 where type ='test'
;

这是为实体级建模目的设置的，这就是为什么我将测试和训练作为类型。

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.

I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.
解决方案
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.* from (select userID , if(rand(5555)<0.1, 'test','train') end as type from (select userID from mytable group by userID ) a ) b right outer join (select * from userID ) c on a.userid=c.userid where type='test' ;
This is set up for entity level modeling purposes, which is why I have test and train as types.

这篇关于带有Hive的随机样本表，但包含匹配的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有Hive的随机样本表，但包含匹配的行 [英] Random sample table with Hive, but including matching rows

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

带有Hive的随机样本表，但包含匹配的行 [英] Random sample table with Hive, but including matching rows

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭