Hadoop Pig - 删除csv标题 [英] Hadoop Pig - Removing csv header

查看:384
本文介绍了Hadoop Pig - 删除csv标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的csv文件在第一行有标题。将它们加载到pig中会对任何后续函数(如SUM)造成混乱。截至今天,我首先对加载的数据应用过滤器,以删除包含标题的行:

  affaires = load'affaires。 csv'使用PigStorage(',')as(NU_AFFA:chararray,date:chararray); 
affaires =按日期过滤的会员符合'../../ ..';

我认为这是一个有点愚蠢的方法,我想知道有没有办法告诉pig不要加载csv的第一行,像加载函数的as_header布尔参数。
我没有在文档上看到它。什么是最佳实践?

解决方案

CSVExcelStorage 跳过标题行,所以代替 PigStorage 使用 CSVExcelStorage 。下载 piggybank.jar 并尝试此选项。



示例

input.csv

 姓名,年龄, 
a,10,chennai
b,20,banglore

(使用SKIP_INPUT_HEADER选项)

  REGISTER'/tmp/piggybank.jar'; 
A = LOAD'input.csv'USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER');
DUMP A;

输出

 (a,10,chennai)
(b,20,banglore)

$ b apache / pig / piggybank / storage / CSVExcelStorage.html> http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html p>

My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers :

affaires    = load 'affaires.csv'   using PigStorage(',') as (NU_AFFA:chararray,    date:chararray) ;
affaires    = filter affaires by date matches '../../..';

I think it is a bit stupid as a method, and I am wondering either there is a way to tell pig not to load the first line of the csv, like a "as_header" boolean parameter to the load function. I don't see it on the doc. What would be a best practice ? How do you usually deal with that ??

解决方案

CSVExcelStorage loader support to skip the header row, so instead of PigStorage use CSVExcelStorage. Download piggybank.jar and try this option.

Sample example

input.csv

Name,Age,Location
a,10,chennai
b,20,banglore

PigScript:(With SKIP_INPUT_HEADER option)

REGISTER '/tmp/piggybank.jar';
A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;

Output:

(a,10,chennai)
(b,20,banglore)

Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

这篇关于Hadoop Pig - 删除csv标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆