如何在蜂巢中为每个组采样? [英] How to sample for each group in hive?

查看:91
本文介绍了如何在蜂巢中为每个组采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在蜂巢中有一张大桌子,价值15亿+.列之一是category_id,它具有〜20个不同的值.我想对表格进行采样,以使每个类别都有100万的值.

I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.

我检出了带有Hive的随机样本表,但包括匹配的行配置单元:从大表创建较小的表,我想出了如何从整个表格中获取随机样本,但是我仍然无法弄清楚如何为每个category_id获取样本.

I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id.

推荐答案

我知道您想在多个文件中采样表.您可能要检查蜂巢存储区

I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.

这篇关于如何在蜂巢中为每个组采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆