Apache Pig:使用 hadoop fs -text 加载显示正常的文件 [英] Apache Pig: Load a file that shows fine using hadoop fs -text

查看:29
本文介绍了Apache Pig:使用 hadoop fs -text 加载显示正常的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有名为 part-r-000[0-9][0-9] 并且包含制表符分隔字段的文件.我可以使用 hadoop fs -text part-r-00000 查看它们,但无法使用 pig 加载它们.

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.

我尝试过的:

x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;

但这只会给我垃圾.如何使用 pig 查看文件?

but that only gives me garbage. How can I view the file using pig?

可能相关的是我的 hdfs 目前仍在使用 CDH-2.此外,如果我将文件下载到本地并运行 file part-r-00000 它说 part-r-00000: data,我不知道如何解压它本地.

What might be of relevance is that my hdfs is still using CDH-2 at the moment. Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

推荐答案

根据 HDFS 文档hadoop fs -text 可用于zip 和 TextRecordInputStream"数据,因此您的数据可能是这些格式之一.

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

如果文件被压缩,通常Hadoop在输出到HDFS时会添加扩展名,但如果没有,您可以尝试通过本地解压缩/解压缩/unbzip2ing/etc进行测试.看起来 Pig 应该自动解压,但可能需要存在文件扩展名(例如 part-r-00000.zip)——更多信息.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

我对 TextRecordInputStream 不太确定.听起来它只是 Pig 的默认方法,但我可能是错的.当我快速搜索 Google 时,我没有看到任何提到通过 Pig 加载这些数据.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

更新:由于您已经发现它是一个序列文件,因此您可以通过以下方式使用 PiggyBank 加载它:

Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();


-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot 
    USING SequenceFileLoader AS (key:long, val:long, etc.);

这篇关于Apache Pig:使用 hadoop fs -text 加载显示正常的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆