使用底层HBase表调整Hive查询 [英] Tuning Hive Queries That Uses Underlying HBase Table

查看:283
本文介绍了使用底层HBase表调整Hive查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Hbase中有一张表让我们说tbl,我想用
Hive来查询它。因此,我将表映射到配置单元,如下所示:

  CREATE EXTERNAL TABLE tbl(id string,data map< string,string>) 
STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES(hbase.columns.mapping=:key,data:)
TBLPROPERTIES(hbase .table.name=tbl);

查询:

  select * from tbl,select tbl select,select id,data 
from tbl

确实很快。



但是像

 从tbl中选择id其中substr(id,0,5)=12345

从tbl中选择id其中data [777]不是NULL

速度非常慢。

相反,从Hbase shell :

 scan'tbl',{
COLUMNS =>'data',STARTROW ='12345', ENDROW ='12346'}或
scan'tbl',{COLUMNS =>'data',FILTER=>
FilterList.new([qualifierFilter('777')]) }

它闪电般快!

当我查看jobtracker上的hive生成的mapred作业时,
发现map.input.records统计了Hbase表中的所有项目,
表示该作业进行全表扫描befo它甚至可以启动任何映射器!
此外,我怀疑它将所有数据从Hbase表复制到hdfs到执行executuion之前的
mapper tmp输入文件夹。

所以,我的问题是 - 为什么配置单元的hbase存储处理程序不会将
hive查询转换为适当的hbase函数?为什么它会扫描所有记录
,然后使用where子句对它们进行分片?如何改进?



提高Hive查询性能的任何建议(映射到HBase Table)。

我们可以在HBase表上创建二级索引吗?



我们使用HBase和Hive集成并试图调整Hive查询的性能。 b

解决方案

很多问题!我会尽量回答所有问题,并给你一些性能提示:



数据不会复制到HDFS,但由HIVE生成的mapreduce作业会将其中间数据存储在HDFS中。



二级索引或替代查询路径是HBase不支持(更多信息)。



Hive将把所有事情转化为MapReduce作业,这些作业需要时间进行分发和处理。初始化,如果行数非常少,那么Hbase shell中的简单SCAN操作比Hive查询快,但是在大数据集上操作比较快,因此必须在数据节点之间分配作业。



Hive HBase处理程序在提取开始和结束时不会很好,停止查询中的行键,像 substr(id,0,5)=12345这样的查询不会使用start&停止行键。



在执行查询之前,运行 EXPLAIN [your_query]; 命令并检查 filterExpr: 部分,如果您没有找到它,您的查询将执行全表扫描。请注意, 过滤器操作符: 中的所有表达式都将转换为适当的过滤器。

  EXPLAIN SELECT * FROM tbl WHERE(id> ='12345')AND(id<'12346')
STAGE PLANS:
Stage:Stage -1
地图减少
Alias - >映射运算符树:
tbl
TableScan
别名:tbl
filterExpr:
expr:((id> ='12345')和(id<'12346' ))
类型:布尔
过滤操作符
....



<幸运的是,有一个简单的方法可以确保开始和结束。停止行键用于查找行键前缀时,只需将 substr(id,0,5)=12345转换为更简单的查询: id> =12345AND id<12346,它将被处理程序检测到并开始&停止行键将提供给SCAN(12345,12346)






现在,这里有一些提示,以便加快你的查询速度(很多):




  • 请确保您设置以下属性以利用批处理来减少



    SET hbase.scan.cache = 10000;



    SET hbase.client.scanner.cache = 10000;


  • 确保您设置了以下属性,以便在任务跟踪器中运行分布式作业,而不是运行本地作业。


    $ b

    SET mapred.job.tracker = [YOUR_JOB_TRACKER]:8021;



    SET hbase.zookeeper.quorum = [ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];


  • 将SELECT语句的列数减至最少。尽量不要 SELECT *


  • 无论何时你想使用start&停止行键以防止全表扫描,总是提供键> = x 键< y 表达式(不要使用BETWEEN运算符)

  • 总是 EXPLAIN SELECT 执行它们之前的查询。



I've got a table in Hbase let's say "tbl" and I would like to query it using Hive. Therefore I mapped a table to hive as follows:

CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");

Queries like:

select * from tbl", "select id from tbl", "select id, data
from tbl

are really fast.

But queries like

select id from tbl where substr(id, 0, 5) = "12345"

select id from tbl where data["777"] IS NOT NULL 

are incredibly slow.

In the contrary when running from Hbase shell:

"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"

it is lightning fast!

When I looked into the mapred job generated by hive on jobtracker I discovered that "map.input.records" counts ALL the items in Hbase table, meaning the job makes a full table scan before it even starts any mappers!! Moreover, I suspect it copies all the data from Hbase table to hdfs to mapper tmp input folder before executuion.

So, my questions are - Why hbase storage handler for hive does not translate hive queries into appropriate hbase functions? Why it scans all the records and then slices them using "where" clause? How can it be improved?

Any suggestions to improve the performance of Hive queries(mapped to HBase Table).

Can we create secondary index on HBase tables?

We are using HBase and Hive integration and trying to tune the performance of Hive queries.

解决方案

Lots of questions!, I'll try to answer all and give you a few performance tips:

The data is not copied to the HDFS, but the mapreduce jobs generated by HIVE will store their intermediate data in the HDFS.

Secondary indexes or alternative query paths are not supported by HBase (more info).

Hive will translate everything into MapReduce jobs which need time to be distributed & initialized, if you have a very small number of rows its possible that a simple SCAN operation in the Hbase shell is faster than a Hive query but on big datasets, distributing the job among the datanodes is a must.

The Hive HBase handler doesn't do a very good job when extracting the start & stop row keys from the query, queries like substr(id, 0, 5) = "12345" won't use start & stop row keys.

Before executing your queries, run a EXPLAIN [your_query]; command and check for the filterExpr: part, if you don't find it, your query will perform a full table scan. On a side note, all expresions within the Filter Operator: will be transformed into the appropiate filters.

EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346')
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        tbl 
          TableScan
            alias: tbl 
            filterExpr:
                expr: ((id>= '12345') and (id < '12346'))
                type: boolean
            Filter Operator
                ....

Fortunately, there is an easy way to make sure start & stop row keys are used when you're looking for row-key prefixes, just convert substr(id, 0, 5) = "12345" to a simpler query: id>="12345" AND id<"12346", it will be detected by the handler and start & stop row keys will be provided to the SCAN (12345, 12346)


Now, here are a few tips in order to speed up your queries (by a lot):

  • Make sure you set the following properties to take advantage of batching to reduce the number of RPC calls (the number depends on the size of your columns)

    SET hbase.scan.cache=10000;

    SET hbase.client.scanner.cache=10000;

  • Make sure you set the following properties to run a distributed job in your task trackers instead of running local job.

    SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;

    SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];

  • Reduce the amount of columns of your SELECT statement to the minimum. Try not to SELECT *

  • Whenever you want to use start & stop row keys to prevent full table scans, always provide key>=x and key<y expressions (don't use the BETWEEN operator)

  • Always EXPLAIN SELECT your queries before executing them.

这篇关于使用底层HBase表调整Hive查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆