Hive 中的排序表(ORC 文件格式) [英] Sorted Table in Hive (ORC file format)

查看:25
本文介绍了Hive 中的排序表(ORC 文件格式)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在确保利用 Hive 表中的排序数据时遇到了一些困难.(使用ORC文件格式)

I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format)

我知道我们可以通过在 create DDL 中声明 DISTRIBUTE BY 子句来影响从 Hive 表中读取数据的方式.

I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL.

CREATE TABLE trades
(
    trade_id INT,
    name STRING,
    contract_type STRING,
    ts INT
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS
STORED AS ORC;

这意味着每次我对该表进行查询时,数据将按 trade_id 分配给各个映射器,然后进行排序.

This will mean that every time I make a query to this table, the data will be distributed by trade_id among the various mappers and afterward it will be sorted.

我的问题是:

我不希望将数据拆分为 N 个文件(存储桶),因为体积不大,我会保留小文件.

I do not want the data to be split into N files (buckets), because the volume is not that much and I would stay with small files.

但是,我确实想利用排序插入.

However, I do want to leverage sorted insertion.

INSERT OVERWRITE TABLE trades
PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY trade_id
SORT BY trade_id;

我真的需要在创建 DLL 语句中使用 CLUSTERED/SORT 吗?或者 Hive/ORC 是否知道如何利用插入过程已经确保数据已排序的事实?

Do I really need to use CLUSTERED/SORT in the create DLL statement? Or does Hive/ORC knows how to leverage the fact that the insertion process already ensured that the data is sorted?

这样做是否有意义:

CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 1 BUCKETS

推荐答案

分桶表是一个过时的概念.

Bucketed table is an outdated concept.

您不需要在表 DDL 中写入 CLUSTERED BY.

You do not need to write CLUSTERED BY in table DDL.

加载表时使用distribute by partition key来减轻reducer的压力,尤其是在写ORC的时候,需要中间缓冲区来构建ORC,如果每个reducer加载很多分区可能会导致OOM异常.

When loading table use distribute by partition key to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.

当表很大时,您可以使用 bytes.per.reducer 限制最大文件大小,如下所示:

When the table is big, you can limit the max file size using bytes.per.reducer like this:

set hive.exec.reducers.bytes.per.reducer=67108864;--or even less

如果你有更多的数据,就会启动更多的 reducer,创建更多的文件.这比加载固定数量的桶更灵活.

If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.

这也会更有效,因为对于小表,您不需要创建更小的存储桶.

This will also work better because for small tables you do not need to create smaller buckets.

ORC 有内部索引和布隆过滤器.应用 SORT 可以提高索引和布隆过滤器的效率,因为所有相似的数据将被存储在一起.此外,这可以根据您的数据熵提高压缩率.

ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.

如果由于数据偏斜,数据量大,分区键分布不够,还可以额外随机分布.如果您有均匀分布的数据,最好按列分布.如果不是,则随机分配,避免单个长时间运行的减速器问题.

If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.

最后你的插入语句可能看起来像这样:

Finally your insert statement may look loke this:

set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer

INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt,                    --partition key is a must for big data
              trade_id,              --some other key if the data is too big and key is
                                     --evenly distributed (no skew)   
              FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts 

SORT BY contract_type; --sort data if you want filtering by this key 
                       --to work better using internal index

不要在表 DDL 中使用 CLUSTERED BY,因为在插入期间使用 DISTRIBUTE BY、ORC w 索引和布隆过滤器 + SORT 可以更灵活地实现相同的目的.

Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.

Distribute + sort 可以将 ORC 文件的大小极大地减少 x3 或 x4 倍.可以更好地压缩类似数据并使内部索引更高效.

Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.

另请阅读:https://stackoverflow.com/a/55375261/2700344这是关于排序的相关答案:https://stackoverflow.com/a/47416027/2700344

Read also this: https://stackoverflow.com/a/55375261/2700344 This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344

这篇关于Hive 中的排序表(ORC 文件格式)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆