Apache Pig:使用hadoop fs -text加载显示正常的文件 [英] Apache Pig: Load a file that shows fine using hadoop fs -text

查看:218
本文介绍了Apache Pig:使用hadoop fs -text加载显示正常的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有名为part-r-000 [0-9] [0-9]的文件,其中包含制表符分隔的字段。我可以使用 hadoop fs -text part -r-00000 查看它们,但无法使用pig加载它们。



我试过的:

  x =加载'part-r-00000'; 
dump x;
x =使用TextLoader()加载'part-r-00000';
dump x;

但这只会给我垃圾。如何使用pig查看文件?



可能相关的是我的hdfs目前仍在使用CDH-2。
此外,如果我将文件下载到本地并运行 file part -r-00000 ,它会显示 part -r-00000:data ,我不知道如何在本地解压缩。 根据 HDFS文档 hadoop fs -text<文件> 可用于zip和TextRecordInputStream数据,因此您的数据可能采用这些格式之一。



如果文件被压缩,通常Hadoop会在输出到HDFS时添加扩展名,但如果缺少该扩展名,可以尝试通过解压缩/ ungzipping / unbzip2ing / etc本地。它似乎猪应该自动解压缩,但可能需要文件扩展名存在(例如part-r-00000.zip) - 更多信息

我不太确定TextRecordInputStream ..这听起来像只是猪的默认方法,但我可能是错的。我没有看到任何提及通过Pig加载这些数据的情况,当我做了一个Google的时候。



更新:
由于您已经发现它是一个序列文件,因此您可以使用PiggyBank来加载它:

   - 使用Cloudera目录结构:
注册/usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache .pig.piggybank.storage.SequenceFileLoader();


- 示例工作:抓住每天的推文数量
A = LOAD'mydir / part-r-000 {00..99}'#不确定猪使用SequenceFileLoader AS(key:long,val:long等);


I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.

What I've tried:

x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;

but that only gives me garbage. How can I view the file using pig?

What might be of relevance is that my hdfs is still using CDH-2 at the moment. Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

解决方案

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

Update: Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();


-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot 
    USING SequenceFileLoader AS (key:long, val:long, etc.);

这篇关于Apache Pig:使用hadoop fs -text加载显示正常的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆