hive / hadoop如何确保每个映射器都能处理本地数据? [英] How does hive/hadoop assures that each mapper works on data that is local for it?

查看:92
本文介绍了hive / hadoop如何确保每个映射器都能处理本地数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

2个基本问题让我烦恼:


  • 我如何确定每个32个文件hive用来存储我的表格在其独特的机器上?

  • 如果发生这种情况,我怎么能确定,如果配置单元创建32个映射器,它们中的每一个都将在其本地数据上工作? hadoop / hdfs可以保证这种魔力,或者作为智能应用程序的hive确保它会发生?



背景:
我拥有一个由32台机器组成的hive集群,并且:


  • 我的所有表格都是用>CLUSTERED BY我使用 hive.enforce.bucketing = true;
  • >
  • 我验证了实际上每个表都以32个文件存储在用户/配置单元/仓库中

  • 我使用的HDFS复制因子为2



  • 谢谢!

    解决方案


    1. 数据放置由HDFS决定。它会尝试在机器上平衡字节。由于复制,每个文件将位于两台计算机上,这意味着您有两台候选计算机用于在本地读取数据。

    2. HDFS知道每个文件的存储位置,Hadoop使用此信息放置与数据存储在同一主机上的映射器。您可以查看作业的计数器,以查看数据本地和机架本地地图任务计数。这是Hadoop的一项功能,您不必担心。


    2 basic questions that trouble me:

    • How can I be sure that each of the 32 files hive uses to store my tables sits on its unique machine?
    • If that happens, how can I be sure that if hive creates 32 mappers, each of them will work on its local data? Does hadoop/hdfs guarantees this magic, or does hive as a smart application makes sure that it will happen?

    Background: I have a hive cluster of 32 machines, and:

    • All my tables are created with "CLUSTERED BY(MY_KEY) INTO 32 BUCKETS"
    • I use hive.enforce.bucketing = true;
    • I verified and indeed every table is stored as 32 files in the user/hive/warehouse
    • I'm using HDFS replication factor of 2

    Thanks!

    解决方案

    1. The data placement is determined by HDFS. It will try to balance bytes over machines. Due to replicate each file will be on two machines, which means you have two candidate machines for reading the data locally.
    2. HDFS knows where each files is stored, and Hadoop uses this information to place mappers on the same hosts as the data is stored. You can look at the counters for your job to see "data local" and "rack local" map task counts. This is a feature of Hadoop that you don't need to worry about.

    这篇关于hive / hadoop如何确保每个映射器都能处理本地数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆