Hive:从大表中创建较小的表 [英] Hive: Creating smaller table from big table

查看:140
本文介绍了Hive:从大表中创建较小的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个拥有15亿行的Hive表。我想创建一个较小的表(使用相同的表模式)与原始表中的约100万行。理想情况下,新行将从原始表中随机抽样,但获得原始表的顶部1M或底部1M也可以。我该怎么做?

> climbage 之前提出的建议,你最好使用Hive的 built-在抽样方法中。

  INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table
TABLESAMPLE 1m ROWS)t;

这个语法是在Hive 0.11中引入。如果您运行的是Hive的旧版本,那么您将被限制为使用 PERCENT 语法。

  INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table
TABLESAMPLE(1 PERCENT)t;

您可以更改百分比以符合您特定的样本量要求。


I currently have a Hive table that has 1.5 billion rows. I would like to create a smaller table (using the same table schema) with about 1 million rows from the original table. Ideally, the new rows would be randomly sampled from the original table, but getting the top 1M or bottom 1M of the original table would be ok, too. How would I do this?

解决方案

As climbage suggested earlier, you could probably best use Hive's built-in sampling methods.

INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table 
TABLESAMPLE (1m ROWS) t;

This syntax was introduced in Hive 0.11. If you are running an older version of Hive, you'll be confined to using the PERCENT syntax like so.

INSERT OVERWRITE TABLE my_table_sample 
SELECT * FROM my_table 
TABLESAMPLE (1 PERCENT) t;

You can change the percentage to match you specific sample size requirements.

这篇关于Hive:从大表中创建较小的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆