配置单元覆盖目录移动进程为distcp? [英] hive overwrite directory move process as distcp?

查看:147
本文介绍了配置单元覆盖目录移动进程为distcp?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在配置单元中运行 INSERT OVERWRITE DIRECTORY 查询时,它似乎将结果存储在 .hivexxxx 暂存文件夹,然后将文件从那里移动到目录...

When I run an INSERT OVERWRITE DIRECTORY query in hive, it seem to store the results in a .hivexxxx staging folder and then move the files from there to the directory...

在地图缩小过程结束时,它显示了这一点:

At the end of the map reduce process, it shows this:

Moving data to: hdfs://nameservice1/user/events/Click2/.hive-staging_hive_2015-11-21_08-32-49_909_6034680686432863037-1/-ext-10000
Moving data to: /user/events/Click2

慢,似乎没有使用 distcp

有没有办法 set hive在这个过程中使用distcp还是有办法 set 它,所以它不会把数据放到那个分段文件中?我没有看到该暂存文件夹中的点...

is there a way to set hive to use distcp during that process or is there a way to set it so it doesn't put data into that staging foler? I don't see the point in that staging folder...

推荐答案

除非您使用HDFS联合身份验证,配置配置单元将.staging * dir作为一个作业放在与目标目录不同的FS /命名空间上(这对于默认设置来说不太可能发生),您可能不希望配置单元做distcp。问题在于现在的配置单元正在将所有输出文件从.staging目录复制到最终目标目录,并且使用distcp将执行同样的操作 - 复制 - 加上的开销为每个文件生成一个完整的mapreduce作业(这是我在Hive 1.1中看到的行为),所以性能可能会更糟糕。唯一可能的例外是,如果你的输出文件是疯狂的大......

Unless you're using HDFS federation and you've configured hive to put the .staging* dir for a job on a different FS/namespace than the destination dir, (which is very unlikely to ever happen with the default settings) you probably don't want hive to do the distcp. The problem is that what hive is doing now is that it is copying all the output files from the .staging dir to the final destination dir, and using distcp will do the same thing - copying - plus the overhead of spawning a whole mapreduce job for every file (that's the behavior I've seen in Hive 1.1), so performance will likely be much worse. Only possible exception is if your output files are insanely large...

但为什么要复制,如果你不需要?这意味着阅读和重写所有文件。 HDFS移动/重命名只是简单地改变了文件的元数据,并且几乎是即时的。

But why copy if you don't have to? That means reading and re-writing all the files. An HDFS move/rename simply changes the metadata of the files and is nearly instant.

为了达到这个目的,我建议在您的配置单元中添加以下(不幸的属性)属性-site.xml -

To get that behavior, I recommend adding the following (unfortunately undocumented) property to your hive-site.xml -

<property>
    <name>hive.exec.stagingdir</name>
    <value>${hive.exec.scratchdir}/${user.name}/.staging</value>
    <description>
      In Hive >= 0.14, set to ${hive.exec.scratchdir}/${user.name}/.staging
      In Hive < 0.14, set to ${hive.exec.scratchdir}/.staging

      You may need to manually create and/or set appropriate permissions on
      the parent dirs ahead of time.
    </description>
</property>

如果您的Hive版本中没有自动替换$ {hive.exec.scratchdir},只需查看它的值并在上面的值中手动替换。例如,使用Hive> 0.14中的hive.exec.scratchdir的默认值,您可以将此值设置为/tmp/hive/${user.name}/.staging,并在Hive< 0.14,设置为/tmp/hive-${user.name}/.staging(你不应该用$ {user.name}来做到这一点,并且这样做不是一个好主意,这个答案的主题)

If ${hive.exec.scratchdir} does not get automatically substituted in your version of Hive, just look up its value and substitute that manually in the value above. For example, with the default value of hive.exec.scratchdir in Hive > 0.14, you would set this value to /tmp/hive/${user.name}/.staging and in Hive < 0.14, set to /tmp/hive-${user.name}/.staging (You shouldn't have to do this with ${user.name}, and it's not a good idea to do so for reasons that are off-topic for this answer)

这篇关于配置单元覆盖目录移动进程为distcp?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆