Hive中的排序表(ORC文件格式) [英] Sorted Table in Hive (ORC file format)

查看:206
本文介绍了Hive中的排序表(ORC文件格式)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在确保利用Hive表中的排序数据方面遇到一些困难.(使用ORC文件格式)

I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format)

我了解我们可以通过在创建DDL中声明 DISTRIBUTE BY 子句来影响如何从Hive表中读取数据.

I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL.

CREATE TABLE trades
(
    trade_id INT,
    name STRING,
    contract_type STRING,
    ts INT
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS
STORED AS ORC;

这意味着每次我对此表进行查询时,数据都会通过 trade_id 在各种映射器之间分配,然后进行排序.

This will mean that every time I make a query to this table, the data will be distributed by trade_id among the various mappers and afterward it will be sorted.

我的问题是:

我不希望将数据拆分为 N 个文件(存储桶),因为容量不那么大,我会保留小文件.

I do not want the data to be split into N files (buckets), because the volume is not that much and I would stay with small files.

但是,我确实想利用排序插入.

However, I do want to leverage sorted insertion.

INSERT OVERWRITE TABLE trades
PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY trade_id
SORT BY trade_id;

我真的需要在创建DLL语句中使用 CLUSTERED/SORT 吗?还是Hive/ORC知道如何利用插入过程已经确保对数据进行排序这一事实?

Do I really need to use CLUSTERED/SORT in the create DLL statement? Or does Hive/ORC knows how to leverage the fact that the insertion process already ensured that the data is sorted?

做类似的事情是否有意义:

Could it make sense to do something like:

CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 1 BUCKETS

推荐答案

带括号的表是一个过时的概念.

Bucketed table is an outdated concept.

您不需要在表DDL中编写CLUSTERED BY.

You do not need to write CLUSTERED BY in table DDL.

在加载表时,请使用按分区键分配来减轻对reducer的压力,尤其是在编写ORC时,这需要中间缓冲区来构建ORC,并且如果每个reducer加载许多分区,则可能会导致OOM异常.

When loading table use distribute by partition key to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.

当表很大时,可以使用bytes.per.reducer限制最大文件大小,如下所示:

When the table is big, you can limit the max file size using bytes.per.reducer like this:

set hive.exec.reducers.bytes.per.reducer=67108864;--or even less

如果您有更多数据,则将启动更多的reducer,创建更多的文件.这比加载固定数量的存储桶更加灵活.

If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.

这也将更好地工作,因为对于小型表,您不需要创建较小的存储桶.

This will also work better because for small tables you do not need to create smaller buckets.

ORC具有内部索引和Bloom过滤器.应用SORT可以提高索引和Bloom过滤器的效率,因为所有相似的数据将被存储在一起.此外,这还可以根据数据的完整性来改善压缩效果.

ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.

如果由于存在一些数据偏斜并且数据量大而无法通过分区键进行分配,则可以另外随机分配.如果您具有均匀分布的数据,则最好按列进行分布.如果不是,请随机分配以避免单个长期运行的减速器问题.

If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.

最后,您的插入语句可能看起来像这样:

Finally your insert statement may look loke this:

set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer

INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt,                    --partition key is a must for big data
              trade_id,              --some other key if the data is too big and key is
                                     --evenly distributed (no skew)   
              FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts 

SORT BY contract_type; --sort data if you want filtering by this key 
                       --to work better using internal index

不要在表DDL中使用CLUSTERED BY,因为在插入过程中使用DISTRIBUTE BY,ORC w索引和Bloom Bloom + SORT可以更灵活地实现相同目的.

Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.

Distribute + sort可以将ORC文件的大小极大地减少3倍或4倍.相似的数据可以得到更好的压缩,并使内部索引更有效.

Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.

也请阅读以下内容: https://stackoverflow.com/a/55375261/2700344 这是有关排序的相关答案: https://stackoverflow.com/a/47416027/2700344

Read also this: https://stackoverflow.com/a/55375261/2700344 This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344

这篇关于Hive中的排序表(ORC文件格式)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆