当基础数据更改时,是否需要在Hive中删除并创建分区? [英] When the underlying data changes, do we need to drop and create the partition in Hive?

查看:81
本文介绍了当基础数据更改时,是否需要在Hive中删除并创建分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个按日期划分的配置单元表,其数据作为Parquet文件存储在S3中.我们还假设对于特定分区(日期),最初有 20 条记录.

Let's say I have a hive table partitioned by date with its data stored in S3 as Parquet files. Let's also assume that for a particular partition (date), there were originally 20 records.

如果我随后删除原始文件并将具有 50 条记录的新Parquet文件放在同一文件夹中,我是否需要删除并重新创建该分区以反映新数据?

If I then delete the original files and put new Parquet files with 50 records in the same folder, do I need to drop and recreate that partition for the new data to reflect?

我的理解是,我们不必重新创建分区.因此,我尝试从相应的文件夹中删除旧数据并保留新数据,而没有更新" Hive分区.但是,当我选择该日期的 count(*)时,它仍然显示为 20 条记录,而不是 50 .再次删除并创建分区后,它开始显示正确的计数.这是预期的行为吗?

My understanding was that we don't have to recreate partitions. So I tried removing old data from the respective folder and keeping the new data without "updating" the Hive partition. However, then when I took count(*) for that date, it still showed as 20 records instead of 50. Upon dropping and creating the partition again, it started showing the correct count. Is that the expected behavior?

推荐答案

Hive使用统计信息优化诸如 select count(*)之类的简单查询.如果设置了此属性:

Hive optimizes simple queries like select count(*) using statistics. If this property is set:

set hive.compute.query.using.stats=true;

然后,Hive将从元数据中存储的统计信息中进行计数.

Then Hive will take count from statistics stored in the metadata.

用新文件替换文件时,统计信息保持不变.删除分区时,所有相关统计信息也被删除,这就是为什么重新创建分区后计数正确的原因.

When you replaced file with new one, statistics remains the same. When you deleted partition, all related statistics was also deleted, this is why you got correct count after re-creating partition.

另请参阅以下答案: Hive select count()非null会返回比select count()高的值-在这种情况下,谓词可防止使用统计信息.

See also this answer: HIVE select count() non null returns higher value than select count() - Predicate prevents statistics usage in this case.

这种行为是完全可以预期的.您可以

This behavior is quite expected. You can either

set hive.compute.query.using.stats=false;

要关闭统计信息用于查询结果的计算,您的分区重新创建实际上会执行相同的操作,因为它删除了统计信息,这就是为什么不使用统计信息并扫描文件的原因.

To switch-off statistics usage for query result calculation, your partition recreation effectively does the same because it removed statistics, this is why statistics was not used and file was scanned.

或者您可以分析表以更新统计信息并保持上述参数设置为true,因此下次您将执行简单聚合时,它将快速运行:

Or you can analyze table to have statistics updated and keep above parameter set true, so next time you will perform simple aggregation, it will work fast:

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  
  COMPUTE STATISTICS

对于具有50条记录的小文件,仅性能差异不是很大.但是最好更新统计信息,优化器也可以使用它来构建最佳查询计划.

For small file with 50 records only performance difference is not so big. But better to have statistics updated, it also being used by optimizer to build optimal query plan.

此处有更多详细信息:分析表

More details here: analyze table

如果使用 INSERT OVERWRITE 插入数据,则可以启用统计信息自动收集:

And if you inserting data using INSERT OVERWRITE, you can enable statistics auto-gathering:

set hive.stats.autogather=true;

这篇关于当基础数据更改时,是否需要在Hive中删除并创建分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆