使用PigLatin（Hadoop）加载多个文件 [英] Load multiple files with PigLatin (Hadoop)

查看：126 发布时间：2018/5/31 19:24:27 hadoop apache-pig

本文介绍了使用PigLatin（Hadoop）加载多个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个格式相同的csv文件的hdfs文件列表。我需要能够将 LOAD 与猪一起使用。例如：

  /path/to/files/2013/01-01/qwe123.csv 
 / path / to /files/2013/01-01/asd123.csv 
 /path/to/files/2013/01-01/zxc321.csv 
 / path / to / files / 2013 / 01-02 / ert435 .csv 
 /path/to/files/2013/01-02/fgh987.csv 
 /path/to/files/2013/01-03/vbn764.csv 
  
 
 
 它们不能全称为随机散列，它们的目录可能包含更多csv文件。
 $ b $正如其他评论中所建议的那样，您可以通过预处理文件来完成此操作。假设您的HDFS文件名为 file_list.txt ，那么您可以执行以下操作： 
 
 
  pig -param flist =`hdfs dfs -cat file_list.txt | awk'BEGIN {ORS =;} {if（NR == 1）print; else print，$ 0;}'`script.pig 
  
 
 
  awk 代码可以去掉换行符，并使用逗号分隔文件名。
 
 
 在脚本中（称为脚本在我的示例中为.pig ），您应该使用参数替换来加载数据： 
 
 
  data = LOAD'$ flist'; 
  
 
I have a hdfs file list of csv files with same format. I need to be able to LOAD them with pig together. Eg:
/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv
They can not be globed as their name is "random" hash and their directories might contain more csv files.
 解决方案 
As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt, then you can do the following:
pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig
The awk code gets rid of the newline characters and uses commas to separate the file names.


In your script (called script.pig in my example), you should use parameter substitution to load the data:
data = LOAD '$flist';


                        
这篇关于使用PigLatin（Hadoop）加载多个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PigLatin（Hadoop）加载多个文件 [英] Load multiple files with PigLatin (Hadoop)

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

使用PigLatin（Hadoop）加载多个文件 [英] Load multiple files with PigLatin (Hadoop)

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭