高基数字段的 Hive 查询性能 [英] Hive query performance for high cardinality field
问题描述
我在 hive 中有一个单一但很大的表,它几乎总是用主键列(例如,employee_id
)进行查询.该表将非常庞大,每天插入数百万行,我想使用该字段上的分区进行快速查询.我关注了这篇文章并且我知道分区仅适用于低基数字段,那么如何使用 employee_id
列实现快速查询的目标?
I have a single but huge table in hive which will almost always be queried with the primary key column (say, employee_id
). The table will be really huge, millions of rows inserted each day and I want to query fast using partitions over this field. I followed this post and I know that partitioning is only good for low cardinality fields, so how can I accomplish my goal of querying fast with employee_id
column?
我知道应该将基数非常高的 id 列用作分桶,但它对我在单个表上的查询性能没有帮助,是吗?
I understand that id column having very high cardinality should be used as bucketing but it does not help me with the query performance over single table, does it?
我认为如果我可以使用诸如 hash(employee_id)
之类的东西作为分区,那将对我很有帮助.这可能吗?我在有关 hive 的文档中看不到这样的内容.
I think that if I could use something like hash(employee_id)
as partitions, it would help me very much. Is this possible? I couldn't see such a thing in the documents about hive.
总而言之,我想要的是快速查询结果:
To summarize, what I want is fast query result for:
select * from employee where employee_id=XXX
假设 employee
表有数十亿条记录,主键列 employee_id
在这种情况下,按年、月、日等进行的经典分区无济于事.
assuming employee
table has billions of records, with primary key column employee_id
where classical partitioning by year, month, day etc does not help.
提前致谢,
推荐答案
- 将 ORC 与布隆过滤器结合使用:
CREATE TABLE employee (
employee_id bigint,
name STRING
) STORED AS ORC
TBLPROPERTIES ("orc.bloom.filter.columns"="employee_id")
;
- 通过矢量化启用 PPD,使用 CBO 和 Tez:
SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.execution.engine=tez;
set hive.stats.fetch.column.stats=true;
set hive.map.aggr=true;
SET hive.tez.auto.reducer.parallelism=true;
在映射器和化简器上调整适当的并行性:
Tune proper parallelism on mappers and reducers:
--映射器示例:
set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;
--reducer 的示例设置:
--example settings for reducers:
set hive.exec.reducers.bytes.per.reducer=67108864; --decrease this to increase the number of reducers
更改这些数字以实现最佳性能.
Change these figures to achieve optimal performance.
这篇关于高基数字段的 Hive 查询性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!