自定义 Mapper 和 Reducer 与 HiveQL [英] Custom Mapper and Reducer vs HiveQL

查看:43
本文介绍了自定义 Mapper 和 Reducer 与 HiveQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题陈述:-

我需要比较两个表 Table1Table2 并且它们都存储相同的东西.所以我需要比较 Table2Table1 因为 Table1 是需要进行比较的主表.因此,在比较之后,我需要报告 Table2 存在某种差异.这两张表有很多数据,大约 TB 的数据.所以目前我已经编写了 HiveQL 来进行比较并取回数据.

I need to compare two tables Table1 and Table2 and they both store same thing. So I need to compare Table2 with Table1 as Table1 is the main table through which comparisons need to be made. So after comparing I need to make a report that Table2 has some sort of discrepancy. And these two tables has lots of data, around TB of data. So currently I have written HiveQL to do the comparisons and get the data back.

所以我的问题是在 PERFORMANCE 方面哪个更好,编写 CUSTOM MAPPER 和 REDUER 来完成这种工作还是 HiveQL 我写的会很好,因为我将在数百万条记录中加入这两个表.据我所知,HiveQL 在内部(在幕后)生成优化的自定义 map-reducer 并提交执行并取回结果.

So my question is which is better in terms of PERFORMANCE, writing a CUSTOM MAPPER and REDUCER to do this kind of job or the HiveQL that I wrote will be fine as I will be joining these two tables on millions of records. As far as I know HiveQL internally (behind the scenes) generates optimized custom map-reducer and submits for execution and gets back the results.

推荐答案

你的问题有两个答案.

首先,如果有一些处理可以用 Hive QL 语法表达,我认为 Hive 的性能与编写自定义 map-reduce 的性能相当.这里唯一的问题是当您有一些关于您在 map-reduce 代码中使用但不是通过 Hive 的数据的额外信息时.例如,如果您的数据已排序,则您可以在映射器中处理文件拆分时使用此信息,而除非 Hive 知道此排序顺序,否则它将无法将这些信息用于其优势.通常,有一种方法可以指定此类额外信息(通过元数据或配置属性),但有时甚至可能无法指定这些信息以供 Hive 使用.

Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.

其次,有时处理可能非常复杂,以至于无法在类似 SQL 的语句中轻松表达.这些情况通常涉及在处理期间必须存储间歇性状态.Hive UDAFs 在一定程度上缓解了这个问题.但是,如果您需要更自定义的东西,我总是更喜欢使用 Hive 转换功能.它允许您在 Hive 查询的上下文中利用 map-reduce,允许您在同一个查询中混合搭配 Hive SQL 类功能与自定义 map-reduce 脚本.

Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.

长话短说:如果您的处理可以通过 Hive QL 查询轻松表达,我认为没有太多理由编写 map-reduce 代码来实现相同的目的.创建 Hive 的主要原因之一是允许像我们这样的人编写类似 SQL 的查询,而不是编写 map-reduce.如果我们最终编写 map-reduce 而不是典型的 Hive 查询(出于性能原因或其他原因),人们可能会争辩说 Hive 在其主要目标方面做得不好.另一方面,如果您有一些 Hive 无法利用的关于您的数据的信息,您最好编写使用该信息的自定义 map-reduce 实现.但是,话又说回来,如果您可以使用前面提到的 Hive 转换功能简单地插入映射器和缩减器,则无需编写整个 map-reduce 程序.

Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.

这篇关于自定义 Mapper 和 Reducer 与 HiveQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆