为什么Hive无法支持非等额加入? [英] Why Hive can not support non-equi join?

查看:82
本文介绍了为什么Hive无法支持非等额加入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现Hive不支持非等额联接,仅仅是因为很难将非等额联接转换为Map reduce吗?

解决方案

是的,问题出在当前的map-reduce实现中.

在MapReduce中如何实现常见的等价联接?

将输入记录分块复制到映射器,映射器将输出作为键值对输出,并使用某种功能在缩减器之间收集和分配输出,从而每个缩减器将处理整个键,即映射器为按键分组的每个化简创建键值列表.精简器复制映射器输出,对其进行排序以获得<键,值列表>.这两个数据集都进行了相同的操作.然后,reducer用相等的键在两个列表上应用叉积.以这种方式实现了等联接.这里的主要思想是将具有相同连接键的元组分配到相同的reducer实例,并在同一reducer上进行处理.这很容易实现,因为密钥本身确定将在哪个reduce上进行处理(计算基于密钥相等性),并且每个reduce实例都从两个数据集中接收专用的密钥列表,没有其他reduce在使用相同的密钥.>

考虑非等额联接:例如,我们需要在A.key< = B.key条件下联接数据集A和B.在这种情况下,Reducer不仅应从两个数据集中接收具有相等键的元组,而且还应为每个B.key接收所有键小于B.key的A元组.使用相同的密钥相等范例很难实现.

如果减速器将为每个A.key B元组接收 B.key<A.key 会导致在reducer上造成大量数据重复.例如,如果我们有A键(1、2、3)和B键(1、2、3),那么对于A.3,我们需要 [A.1,A.2,A.3] .对于A.2,我们需要 [A.1,A.2] .换句话说,映射器需要为每个特定键生成一个副本,映射器为不同键生成的列表将重叠.键越独特,我们的重复就越大.

阅读本文以深入探究问题和可能的解决方案: 解决方案

Yes, the problem is in current map-reduce implementation.

How common equi-join is implemented in MapReduce?

Input records are being copied in chunks to the mappers, mappers produce output as key-value pairs, which are collected and distributed between reducers using some function in such way that each reducer will process the whole key, in other words, mapper creates a list of key-values for each reducer grouped by key. Reducers copy mappers output, sort it to get <key, list of values>. The same is being done for both datasets. Then reducer applies cross-product on both lists with equal keys. In such way the equi-join is implemented. The main idea here is that tuples with the same join key are distributed to the same instance of reducer and being processed on the same reducer. This is easy to implement because key itself determines on which reducer it will be processed (computation is based on key-equality) and each reducer instance receives it's dedicated key list from both datasets, no other reducers are working with the same keys.

Consider non-equi-join: For example we need to join datasets A and B on A.key<=B.key condition. In this case the reducer should receive not only tuples with equal keys from both datasets, but also for each B.key all A tuples with key less then B.key. It is difficult to implement using the same key equality paradigm.

If reducer will receive for each A.key B-tuples with B.key < A.key than it will cause huge duplication of data on reducer. for example if we have A keys (1, 2, 3) and B keys (1,2,3) then for A.3 we need [A.1, A.2, A.3]. For A.2 we need [A.1, A.2]. In other words, the mapper need to produce a duplication for each particular key, lists produced by mappers for different keys will be overlapped. The more distinct keys we have the bigger duplication it will be.

Read this paper for deep dive into problems and possible solutions: Processing Theta-Joins using MapReduce

这篇关于为什么Hive无法支持非等额加入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆