在HIVE中将多个文件添加到分布式缓存中 [英] Add multiple files to distributed cache in HIVE

查看:869
本文介绍了在HIVE中将多个文件添加到分布式缓存中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前遇到问题,将文件夹内容添加到Hives不受信任的缓存。我可以成功地将多个文件添加到Hive中的分布式缓存中:
添加文件/folder/file2.ext;
添加文件/folder/file3.ext;



我还看到有一个添加文件(复数)选项,在我看来,您可以指定一个目录,如:添加文件/文件夹/ ; 并且文件夹中的所有内容都包含在内(这适用于Hadoop Streaming -files选项)。但这不适用于Hive。现在我必须明确地添加每个文件。



我做错了吗?是否有办法将整个文件夹的内容传递给分布式缓存。



我尝试了通配符添加文件/文件夹/ * 添加文件/文件夹/ * ,但也失败了。 p>

编辑:



从hive 0.11开始,

 添加文件/文件夹





我使用的是将文件夹位置作为参数传递给hive脚本:

  $ hive -f my-query.hql -hiveconf folder = / folder 

和my-query.hql文件中:

 添加文件$ {hiveconf:folder} 

现在又好又快!

解决方案

添加不支持目录,但作为解决方法,您可以 zip 这些文件。然后以档案 ADD ARCHIVE my.zip )将其添加到分布式缓存中。作业运行时,归档文件的内容将被解压到
从属节点的本地作业目录中(请参阅 mapred.job.classpath.archives 属性)



如果您想要传递的文件数量相对较少,并且您不想处理归档,则还可以编写一个小脚本来准备添加文件命令,以获得您在给定目录中的所有文件:

例如:

 #!/ bin / bash 
#list.sh

if [! $ 1]
然后
echo目录已丢失!
出口1
fi

ls -d $ 1 / * |同时读f;确认echo ADD FILE $ f \ ;;;完成

然后从Hive shell中调用它并执行生成的输出:

 !/ home / user / list.sh / path / to / files 


I currently have an issue adding a folders contents to Hives distrusted cache. I can successfully add multiple files to the distributed cache in Hive using:

ADD FILE /folder/file1.ext;
ADD FILE /folder/file2.ext;
ADD FILE /folder/file3.ext;
etc.

.

I also see that there is a ADD FILES (plural) option which in my mind means you could specify a directory like: ADD FILES /folder/; and everything in the folder gets included (this works with Hadoop Streaming -files option). But this does not work with Hive. Right now I have to explicitly add each file.

Am I doing this wrong? Is there a way to had a whole folders contents to the distributed cache.

P.S. I tried wild cards ADD FILE /folder/* and ADD FILES /folder/* but that fails too.

Edit:

As of hive 0.11 this now supported so:

ADD FILE /folder

now works.

What I am using is passing the folder location to the hive script as a param so:

$ hive -f my-query.hql -hiveconf folder=/folder

and in the my-query.hql file:

ADD FILE ${hiveconf:folder}

Nice and tidy now!

解决方案

Add doesn't support directories, but as a workaround you can zip the files. Then add the it to the distributed cache as an archive (ADD ARCHIVE my.zip). When the job is running the content of the archive will be unpacked on the local job directory of the slave nodes (see the mapred.job.classpath.archives property)

If the number of the files you want to pass is relatively small, and you don't want deal with archives you can also write a small script which prepares the add file command for all the files you have in a given directory:
E.g:

#!/bin/bash
#list.sh

if [ ! "$1" ]
then
  echo "Directory is missing!"
  exit 1
fi

ls -d $1/* | while read f; do echo ADD FILE $f\;; done

Then invoke it from the Hive shell and execute the generated output:

!/home/user/list.sh /path/to/files

这篇关于在HIVE中将多个文件添加到分布式缓存中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆