hive版本0.13.1中的性能问题 [英] Performance issue in hive version 0.13.1

查看:363
本文介绍了hive版本0.13.1中的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 AWS-EMR 运行我的Hive查询,而我在运行hive版本0.13.1时遇到性能问题。



更新版本的配置单元需要大约5分钟来运行10行数据。但230804行的相同脚本需要2天,并且仍在运行。我应该怎么做才能分析和解决问题?



示例数据:



表1:

  hive>描述foo; 
确定
orderno字符串
所用时间:0.101秒,提取:1行




table1的示例数据:



  hive> select * from foo; 
OK
1826203307
1826207803
1826179498
1826179657


$ b $表2:b
$ b

  hive>描述de_geo_ip_logs; 
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
所用时间:0.047秒,提取:4行




表2的样本数据:



  hive>从栏中选择*; 

127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163

我的查询:

  SELECT b.itemcode 
FROM foo a,bar b
where a.orderno BETWEEN b.startorderno AND b.endorderno;

解决方案

在Hive日志输出的最顶端,它状态警告:随机加入JOIN [4] [表格a,b]在Stage'Stage-1 Mapred'中是一个交叉产品。

编辑:
'交叉产品'或笛卡尔积是没有条件的连接,它返回'b'表中的每一行,'a'表中的每一行。所以,如果你举一个'a'为5行,'b'为10行的例子,你可以得到产品,或者5乘以10 = 50行返回。对于一个表或其他表,将会有很多行为'null'。



现在,如果您有一个包含20,000行的表a它到500,000行的另一个表b,您要求SQL引擎返回一个数据集a,b10,000,000,000行,然后对这1000万行执行BETWEEN操作。



因此,如果删除'b'行的数量,您会发现您会比'a'获得更多好处 - 在您的示例中,如果可以过滤ip_logs表,表2,因为我猜测它有比你的订单号码表更多的行,它会减少执行时间。
END EDIT



您不强制执行引擎通过笛卡尔产品工作,因为未指定连接的条件。它必须反复扫描所有的表格。 10行,你不会有问题。使用20k,你会遇到几十个map / reduce wave。



试试这个查询:

  SELECT b.itemcode 
FROM foo a JOIN bar b on< SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;

但是我很难弄清楚你的模型允许加入的列。也许这个表达式的数据模型可以改进?这可能只是我没有清楚地阅读示例。



无论采用哪种方式,都需要过滤where子句之前的比较次数。我在Hive中完成的其他方法是使用一组较小的数据创建一个视图,并加入/匹配视图,而不是原始表。


I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.

The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?

Sample Data:

Table 1:

hive> describe foo;
OK
orderno    string
Time taken: 0.101 seconds, Fetched: 1 row(s)

Sample data for table1:

hive>select * from foo;        
OK
1826203307
1826207803
1826179498
1826179657

Table 2:

hive> describe de_geo_ip_logs;
OK
id          bigint                                      
startorderno        bigint                                      
endorderno          bigint                                      
itemcode                int                                         
Time taken: 0.047 seconds, Fetched: 4 row(s)

Sample data for Table 2:

hive> select * from bar;

127698025   417880320   417880575   306
127698025   3038626048  3038626303  584
127698025   3038626304  3038626431  269
127698025   3038626560  3038626815  163

My Query:

SELECT b.itemcode
FROM foo a,  bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;

解决方案

In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."

EDIT: A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.

Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.

So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time. END EDIT

You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.

Try this query:

 SELECT b.itemcode
 FROM foo a JOIN bar b on <SomeKey>
 WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;

But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.

Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

这篇关于hive版本0.13.1中的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆