为什么 Hive 中的 Fetch 任务比 Map-only 任务运行得更快? [英] Why is Fetch task in Hive works faster than Map-only task?
问题描述
可以使用 hive hive.fetch.task.conversion
参数在 Hive 中启用 Fetch 任务以进行简单查询,而不是 Map 或 MapReduce.
It is possible to enable Fetch task in Hive for simple query instead of Map or MapReduce using hive hive.fetch.task.conversion
parameter.
请解释为什么 Fetch 任务的运行速度比 Map 快得多,尤其是在做一些简单的工作时(例如 select * from table limit 10;
)?在这种情况下,什么仅地图任务正在执行?在我的情况下,性能差异要快 20 倍以上.两个任务都应该读取表数据,不是吗?
Please explain why Fetch task is running much faster than Map especially when doing some simple work (for example select * from table limit 10;
)? What map-only task is doing additionally in this case? The performance difference is more than 20 times faster in my case. Both tasks should read the table data, isn't it?
推荐答案
FetchTask 直接获取数据,而 Mapreduce 会调用 map reduce job
FetchTask directly fetches data, whereas Mapreduce will invoke a map reduce job
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
<description>
Some select queries can be converted to single FETCH task
minimizing latency.Currently the query should be single
sourced not having any subquery and should not have
any aggregations or distincts (which incurrs RS),
lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
</description>
</property>
还有另一个参数 hive.fetch.task.conversion.threshold
默认情况下在 0.10-0.13 中是 -1 并且 >0.14 是 1G(1073741824)这表明,如果表大小大于 1G,则使用 Mapreduce 代替 Fetch 任务
Also there is another parameter hive.fetch.task.conversion.threshold
which by default in 0.10-0.13 is -1 and >0.14 is 1G(1073741824)
This indicates that, If table size is greater than 1G use Mapreduce instead of Fetch task
这篇关于为什么 Hive 中的 Fetch 任务比 Map-only 任务运行得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!