自定义映射器和Reducer vs HiveQL [英] Custom Mapper and Reducer vs HiveQL

查看：90 发布时间：2018/5/31 19:01:32 performance hadoop mapreduce hive hiveql

本文介绍了自定义映射器和Reducer vs HiveQL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题陈述：

我需要比较两个表 Table1 和 Table2 并且它们都存储相同的东西。所以我需要比较 Table2 和 Table1 作为 Table1 是通过它进行比较的主表。所以在比较之后，我需要做一个报告： Table2 存在某种差异。这两个表格有大量的数据，大约是TB的数据。所以目前我写了 HiveQL 来进行比较并获取数据。

所以我的问题是哪个更好 PERFORMANCE ，写一个 CUSTOM MAPPER和REDUCER 来完成这种工作，或者我编写的 HiveQL 会没问题，因为我将在数百万条记录中加入这两个表。据我所知 HiveQL 内部（幕后）会生成优化的自定义映射缩减器并提交执行并返回结果。

<解决方案您的问题的答案是双重的。

首先，如果您有一些处理，您可以用Hive QL语法表达，我认为Hive的性能与编写自定义map-reduce相当。这里唯一的解决方法就是当你有一些关于你的数据的额外信息时，你可以在你的map-reduce代码中使用这些信息，而不是通过Hive。例如，如果你的数据是排序的，你可以在处理你的文件时使用这个信息 - 在映射器中拆分，但除非Hive知道这个排序顺序，否则它将无法使用这个信息优点。通常情况下，有一种方法可以指定这些额外的信息（通过元数据或配置属性），但有时甚至可能没有办法指定这些信息供Hive使用。

其次，有时候处理过程可能会令人费解，以至于在SQL语句中不易表达。这些情况通常涉及在处理过程中必须存储间歇状态。 Hive UDAF 在一定程度上缓解了这个问题。但是，如果您需要更多的自定义功能，我一直倾向使用 Hive变换功能。它允许您在Hive查询的上下文中利用map-reduce，从而使您可以在同一个查询中将Hive SQL类功能与自定义map-reduce脚本混合搭配。

长话短说：如果您的处理可以通过Hive QL查询轻松表达，我没有看到写map-reduce代码来实现相同的很多理由。 Hive创建的主要原因之一是允许像我们这样的人编写类似SQL的查询，而不是编写map-reduce。如果我们最终编写map-reduce而不是典型的Hive查询（出于性能原因或其他原因），可以争辩说Hive在其主要目标方面做得并不好。另一方面，如果您有关于Hive无法利用的数据的一些信息，则最好使用自定义的map-reduce实现来利用该信息。但是，如果您只需使用前面提到的Hive变换功能插入映射器和简化器，则无需编写整个map-reduce程序。

Problem Statement:-

I need to compare two tables Table1 and Table2 and they both store same thing. So I need to compare Table2 with Table1 as Table1 is the main table through which comparisons need to be made. So after comparing I need to make a report that Table2 has some sort of discrepancy. And these two tables has lots of data, around TB of data. So currently I have written HiveQL to do the comparisons and get the data back.

So my question is which is better in terms of PERFORMANCE, writing a CUSTOM MAPPER and REDUCER to do this kind of job or the HiveQL that I wrote will be fine as I will be joining these two tables on millions of records. As far as I know HiveQL internally (behind the scenes) generates optimized custom map-reducer and submits for execution and gets back the results.

解决方案

The answer to your question is two-fold.

Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.

Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.

Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.

这篇关于自定义映射器和Reducer vs HiveQL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

自定义映射器和Reducer vs HiveQL [英] Custom Mapper and Reducer vs HiveQL

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

自定义映射器和Reducer vs HiveQL [英] Custom Mapper and Reducer vs HiveQL

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭