有没有一种方法可以识别或检测Hive表中的数据偏斜? [英] Is there a way to identify or detect data skew in Hive table?

查看:106
本文介绍了有没有一种方法可以识别或检测Hive表中的数据偏斜?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有许多配置单元查询,这些查询需要很长时间.我们正在使用tez和其他良好做法,例如CBO,orc文件等.

有没有一种方法可以像某些命令一样检查/分析数据偏斜?解释计划会有所帮助吗?如果可以,我应该寻找哪个参数?

解决方案

说明计划对此无济于事,您应该检查数据.如果是联接,请从联接中涉及的所有表中选择前100个联接键值,如果它是解析函数,则对键进行分区也要执行相同的操作.

示例:

select key, count(*) cnt
   from table
  group by key
 having count(*)> 1000 --check also >1 for tables where it should not be duplication (like dimentions)
  order by cnt desc limit 100;

key可以是复杂的联接键(在联接ON条件中使用的所有列).

也请查看以下答案: https://stackoverflow.com/a/51061613/2700344

We have many hive queries that take lot of time. We are using tez and other good practices like CBO, using orc files etc.

Is there a way to check / analyze data skew like some command? Would an explain plan help and if so, which parameter should I look for?

解决方案

Explain plan will not help in this, you should check data. If it is a join, select top 100 join key value from all tables involved in the join, do the same for partition by key if it is analytic function and you will see if it is a skew.

Example:

select key, count(*) cnt
   from table
  group by key
 having count(*)> 1000 --check also >1 for tables where it should not be duplication (like dimentions)
  order by cnt desc limit 100;

key can be complex join key (all columns you are using in the join ON condition).

Also have a look at this answer: https://stackoverflow.com/a/51061613/2700344

这篇关于有没有一种方法可以识别或检测Hive表中的数据偏斜?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆