Hadoop Pig - 删除csv标题 [英] Hadoop Pig - Removing csv header
问题描述
我的csv文件在第一行有标题。将它们加载到pig中会对任何后续函数(如SUM)造成混乱。截至今天,我首先对加载的数据应用过滤器,以删除包含标题的行:
affaires = load'affaires。 csv'使用PigStorage(',')as(NU_AFFA:chararray,date:chararray);
affaires =按日期过滤的会员符合'../../ ..';
我认为这是一个有点愚蠢的方法,我想知道有没有办法告诉pig不要加载csv的第一行,像加载函数的as_header布尔参数。
我没有在文档上看到它。什么是最佳实践?
CSVExcelStorage
跳过标题行,所以代替 PigStorage
使用 CSVExcelStorage
。下载 piggybank.jar
并尝试此选项。
示例
input.csv
姓名,年龄,
a,10,chennai
b,20,banglore
(使用SKIP_INPUT_HEADER选项)
REGISTER'/tmp/piggybank.jar';
A = LOAD'input.csv'USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
DUMP A;
输出:
(a,10,chennai)
(b,20,banglore)
$ b apache / pig / piggybank / storage / CSVExcelStorage.html> http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html p>
My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers :
affaires = load 'affaires.csv' using PigStorage(',') as (NU_AFFA:chararray, date:chararray) ;
affaires = filter affaires by date matches '../../..';
I think it is a bit stupid as a method, and I am wondering either there is a way to tell pig not to load the first line of the csv, like a "as_header" boolean parameter to the load function. I don't see it on the doc. What would be a best practice ? How do you usually deal with that ??
CSVExcelStorage
loader support to skip the header row, so instead of PigStorage
use CSVExcelStorage
. Download piggybank.jar
and try this option.
Sample example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER option)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;
Output:
(a,10,chennai)
(b,20,banglore)
Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
这篇关于Hadoop Pig - 删除csv标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!