使用 PigLatin (Hadoop) 加载多个文件 [英] Load multiple files with PigLatin (Hadoop)
问题描述
我有一个相同格式的 csv 文件的 hdfs 文件列表.我需要能够将它们与猪一起LOAD
.例如:
I have a hdfs file list of csv files with same format. I need to be able to LOAD
them with pig together. Eg:
/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv
它们不能被全局化,因为它们的名字是随机"哈希并且它们的目录可能包含更多的 csv 文件.
They can not be globed as their name is "random" hash and their directories might contain more csv files.
推荐答案
正如其他评论中所建议的,您可以通过对文件进行预处理来做到这一点.假设您的 HDFS 文件名为 file_list.txt
,那么您可以执行以下操作:
As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt
, then you can do the following:
pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig
awk
代码去掉了换行符并使用逗号分隔文件名.
The awk
code gets rid of the newline characters and uses commas to separate the file names.
在您的脚本中(在我的示例中称为 script.pig
),您应该使用参数替换来加载数据:
In your script (called script.pig
in my example), you should use parameter substitution to load the data:
data = LOAD '$flist';
这篇关于使用 PigLatin (Hadoop) 加载多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!