高基数字段的 Hive 查询性能 [英] Hive query performance for high cardinality field

查看:23
本文介绍了高基数字段的 Hive 查询性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 hive 中有一个单一但很大的表,它几乎总是用主键列(例如,employee_id)进行查询.该表将非常庞大,每天插入数百万行,我想使用该字段上的分区进行快速查询.我关注了这篇文章并且我知道分区仅适用于低基数字段,那么如何使用 employee_id 列实现快速查询的目标?

I have a single but huge table in hive which will almost always be queried with the primary key column (say, employee_id). The table will be really huge, millions of rows inserted each day and I want to query fast using partitions over this field. I followed this post and I know that partitioning is only good for low cardinality fields, so how can I accomplish my goal of querying fast with employee_id column?

我知道应该将基数非常高的 id 列用作分桶,但它对我在单个表上的查询性能没有帮助,是吗?

I understand that id column having very high cardinality should be used as bucketing but it does not help me with the query performance over single table, does it?

我认为如果我可以使用诸如 hash(employee_id) 之类的东西作为分区,那将对我很有帮助.这可能吗?我在有关 hive 的文档中看不到这样的内容.

I think that if I could use something like hash(employee_id) as partitions, it would help me very much. Is this possible? I couldn't see such a thing in the documents about hive.

总而言之,我想要的是快速查询结果:

To summarize, what I want is fast query result for:

select * from employee where employee_id=XXX

假设 employee 表有数十亿条记录,主键列 employee_id 在这种情况下,按年、月、日等进行的经典分区无济于事.

assuming employee table has billions of records, with primary key column employee_id where classical partitioning by year, month, day etc does not help.

提前致谢,

推荐答案

  1. 将 ORC 与布隆过滤器结合使用:

    CREATE TABLE employee (
      employee_id bigint,
      name STRING
    ) STORED AS ORC 
    TBLPROPERTIES ("orc.bloom.filter.columns"="employee_id")
    ;

  1. 通过矢量化启用 PPD,使用 CBO 和 Tez:

    SET hive.optimize.ppd=true;
    SET hive.optimize.ppd.storage=true;
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled = true;
    SET hive.cbo.enable=true;
    set hive.stats.autogather=true;
    set hive.compute.query.using.stats=true;
    set hive.stats.fetch.partition.stats=true;
    set hive.execution.engine=tez;
    set hive.stats.fetch.column.stats=true;
    set hive.map.aggr=true;
    SET hive.tez.auto.reducer.parallelism=true; 

参考:https://community.cloudera.com/t5/Community-Articles/Optimizing-Hive-queries-for-ORC-formatted-tables/ta-p/248164

  1. 在映射器和化简器上调整适当的并行性:

  1. Tune proper parallelism on mappers and reducers:

--映射器示例:

 set tez.grouping.max-size=67108864;
 set tez.grouping.min-size=32000000;

--reducer 的示例设置:

--example settings for reducers:

 set hive.exec.reducers.bytes.per.reducer=67108864; --decrease this to increase the number of reducers

更改这些数字以实现最佳性能.

Change these figures to achieve optimal performance.

这篇关于高基数字段的 Hive 查询性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆