你如何处理 Apache Pig 中空的或丢失的输入文件? [英] How do you deal with empty or missing input files in Apache Pig?

查看:19
本文介绍了你如何处理 Apache Pig 中空的或丢失的输入文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的工作流程使用 AWS 弹性映射减少集群来运行一系列 Pig 作业,以将大量数据处理为聚合报告.不幸的是,输入数据可能不一致,并且可能导致没有输入文件或 0 字节文件被提供给管道,甚至由管道的某些阶段生成.

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.

在 LOAD 语句期间,如果 Pig 找不到任何输入文件或任何输入文件为 0 字节,则它会严重失败.

During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.

有什么好的方法可以解决这个问题(希望在 Pig 配置或脚本或 Hadoop 集群配置中,无需编写自定义加载程序...)?

Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?

(由于我们使用的是 AWS elastic map reduce,所以我们只能使用 Pig 0.6.0 和 Hadoop 0.20.)

(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)

推荐答案

(为了后代,我们提出了一个低于标准的解决方案:)

(For posterity, a sub-par solution we've come up with:)

为了处理 0 字节问题,我们发现我们可以检测这种情况,而是插入一个带有单个换行符的文件.这会导致如下消息:

To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:

Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).

但至少 Pig 不会因异常而崩溃.

but at least Pig doesn't crash with an exception.

或者,我们可以为该文件生成具有适当数量的 '\t' 字符的行,以避免警告,但它会将垃圾插入数据中,然后我们必须过滤掉.

Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.

这些相同的想法可以通过创建一个虚拟文件来解决没有输入文件的情况,但它具有与上面列出的相同的缺点.

These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.

这篇关于你如何处理 Apache Pig 中空的或丢失的输入文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆