理论上可以在配置单元中使用并置连接(a-la-netezza)吗? [英] Is a collocated join (a-la-netezza) theoretically possible in hive?

查看:121
本文介绍了理论上可以在配置单元中使用并置连接(a-la-netezza)吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您连接分布在同一个键上的表并在联接条件中使用这些键列时,netezza中的每个SPU(机器)都会独立于另一个工作(参见 nz-interview )。

在配置单元中,有 bucketed地图连接,但是将表示这些表的文件分配给datanode是HDFS的责任,它不是根据hive CLUSTERED BY键完成的!假设我有两个表,由相同的密钥组成,我通过这个密钥加入 - 可以从HDFS获得一个保证,即匹配的桶将位于同一个桶上节点?或者是否总是需要将小表的匹配桶移动到包含大表桶的datanode?



感谢,ido

(注意:这是对我以前的问题的更好的表述:如何配置/ hadoop可以确保每个映射器都能处理本地数据吗?)

我认为这是不可能的告诉HDFS在哪里存储数据块。

我可以考虑以下技巧,这将适用于小型群集 - 将其中一个表的复制因子增加为接近或等于群集中节点数量的数目。

因此 - 在连接过程中,合适的数据几乎总是(或总是)出现在所需节点上。


When you join tables which are distributed on the same key and used these key columns in the join condition, then each SPU (machine) in netezza works 100% independent of the other (see nz-interview).

In hive, there's bucketed map join, but the distribution of the files representing the tables to datanode is the responsibility of HDFS, it's not done according to hive CLUSTERED BY key!

so suppose I have 2 tables, CLUSTERED BY the same key, and I join by that key - can hive get a guarantee from HDFS that matching buckets will sit on the same node? or will it always have to move the matching bucket of the small table to the datanode containing the big table bucket?

Thanks, ido

(note: this is a better phrasing of my previous question: How does hive/hadoop assures that each mapper works on data that is local for it?)

解决方案

I think it is not possible to tell to HDFS where to store blocks of data.
I can consider the following trick, which will do for small clusters - to increase replication factor for one of the tables to the number close or equal to the number of nodes in the cluster.
As a result - during join process appropriate data will be almost always (or always) present on the required node.

这篇关于理论上可以在配置单元中使用并置连接(a-la-netezza)吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆