Impala表中的压缩 [英] Compaction in Impala Tables

查看:279
本文介绍了Impala表中的压缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解Impala表中的压缩,但是找不到要研究的材料。
有什么不同的技术,可以在哪里找到研究的材料。

I want to know about the compaction in Impala tables but can't find material to study about. What are different techniques and where I can find material to study about it.

推荐答案

<$的主要技术c $ c> compaction 是为了避免小文件问题,这取决于您的用例。

The principal technique for compaction is to avoid the small file problem and it depends of your use case.

例如,您可能有一个将小文件写入 HDFS 的过程,而您想查询这些文件,例如 Impala表。您可以为这些小文件使用临时表,并使用 INSERT INTO加载基本表 TABLE base_table SELECT ..... FROM stg_table 将小型文件压缩为更大的文件。

For example, you could have a process that is writing small files into HDFS and you want to query those files like an Impala table. You could have a staging table for these small files and load the base table using INSERT INTO TABLE base_table SELECT .....FROM stg_table to compact the tiny files into bigger files.

另一个用例是使用分区
使用分区时的主要风险是创建使您陷入小文件问题的分区。
发生这种情况时,对表进行分区实际上会降低查询性能
(与使用分区时的目标相反),因为它会导致创建太多的小文件。
使用动态分区时更有可能,但在静态分区 >-例如,如果您每天在销售表
中添加一个新分区,其中包含前一天的销售额,
和每天的数据就不是特别大。

Another use case would be with partitioning. A major risk when using partitioning is creating partitions that lead you into the small files problem. When this happens, partitioning a table will actually worsen query performance (the opposite of the goal when using partitioning) because it causes too many small files to be created. This is more likely when using dynamic partitioning, but it could still happen with static partitioning—for example if you added a new partition to a sales table on a daily basis containing the sales from the previous day, and each day’s data is not particularly big.

在选择分区时,您希望在过多的分区
(导致小文件问题)和过少的分区(提供性能的好处)之间取得平衡。
分区列中的一个或多个分区应具有合理数量的值
,但是您应该认为合理的值很难量化。

When choosing your partitions, you want to strike a happy balance between too many partitions (causing the small files problem) and too few partitions (providing performance little benefit). The partition column or columns should have a reasonable number of values for the partitions—but what you should consider reasonable is difficult to quantify.

使用动态分区特别危险,因为如果您不小心使用
,则很容易在包含太多不同值的列上进行分区。
想象一个用例,您经常在查询中指定的时间范围内寻找
以内的数据。
您可能会认为对与时间有关的列进行分区是个好主意。
但是 TIMESTAMP 列的时间可以达到纳秒,因此每一行都可以具有唯一的值;
对于分区列将是一个糟糕的选择!甚至到一分钟或一小时,都可能创建
太多的分区,具体取决于数据的性质;
按较大的时间单位(例如日,月甚至年)进行分区可能是一个更好的选择。

Using dynamic partitioning is particularly dangerous because if you're not careful, it's easy to partition on a column with too many distinct values. Imagine a use case where you are often looking for data that falls within a time frame that you would specify in your query. You might think that it's a good idea to partition on a column that pertains to time. But a TIMESTAMP column could have the time to the nanosecond, so every row could have a unique value; that would be a terrible choice for a partition column! Even to the minute or hour could create far too many partitions, depending on the nature of your data; partitioning by larger time units like day, month, or even year might be a better choice.

上面的阅读内容仅仅是问题的介绍,有更多的用途案例,一般主题是性能和调优。

The reading above it's only an introduction of the problem, there are more many use cases and the general topic is performance and tuning.

您可以从Cloudera文档开始。您可以点击以下链接:

You could get a start from the Cloudera documentation. You could follow this link:

为性能调整Impala

希望这会有所帮助。

这篇关于Impala表中的压缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆