如何在蜂巢中为每个组采样? [英] How to sample for each group in hive?
问题描述
我在蜂巢中有一张大桌子,价值15亿+.列之一是category_id
,它具有〜20个不同的值.我想对表格进行采样,以使每个类别都有100万的值.
I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id
, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.
我检出了带有Hive的随机样本表,但包括匹配的行和配置单元:从大表创建较小的表,我想出了如何从整个表格中获取随机样本,但是我仍然无法弄清楚如何为每个category_id
获取样本.
I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id
.
推荐答案
我知道您想在多个文件中采样表.您可能要检查蜂巢存储区或
I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.
这篇关于如何在蜂巢中为每个组采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!