如何在 Pig 中强制存储(覆盖)到 HDFS? [英] How to force STORE (overwrite) to HDFS in Pig?

查看:25
本文介绍了如何在 Pig 中强制存储(覆盖)到 HDFS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在开发使用 STORE 命令的 Pig 脚本时,我必须删除每次运行的输出目录,否则脚本会停止并提供:

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:

2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists

所以我正在寻找一种自动删除目录的 in-Pig 解决方案,如果在调用时目录不存在,也不会阻塞.

So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.

在 Pig Latin Reference 中,我找到了 shell 命令调用程序 fs.不幸的是,只要有任何事情产生错误,Pig 脚本就会中断.所以我不能使用

In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use

fs -rmr foo/bar

(即递归删除),因为如果目录不存在它会中断.有一瞬间我以为我可以用

(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use

fs -test -e foo/bar

这是一个测试,不应该中断,我想.然而,Pig 再次将 test 在不存在的目录上的返回代码解释为失败代码并中断.

which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.

Pig 项目有一个 JIRA 票,用于解决我的问题并提出建议STORE 命令的可选参数 OVERWRITEFORCE_WRITE.无论如何,我出于需要使用 Pig 0.8.1,并且没有这样的参数.

There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

推荐答案

最后我在 grokbase.由于找到解决方案的时间太长,我将在这里重现并添加进去.

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

假设您想使用语句存储输出

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

然后,为了删除目录,可以在脚本的开头调用

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

没有;"或需要引号,因为它是一个 shell 命令.

No ";" or quotations required since it is a shell command.

我现在无法重现它,但在某个时间点我收到一条错误消息(有关丢失文件的信息),我只能假设 rmf 干扰了 map/reduce.所以我建议将调用放在任何关系声明之前.在 SETs 之后,REGISTERs 和默认值应该没问题.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.

示例:

SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';

这篇关于如何在 Pig 中强制存储(覆盖)到 HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆