Hadoop Pig - 删除 csv 标头 [英] Hadoop Pig - Removing csv header

查看：38 发布时间：2021/11/12 4:07:10 csv hadoop apache-pig

本文介绍了Hadoop Pig - 删除 csv 标头的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的 csv 文件在第一行有标题.将它们加载到 pig 会在任何后续函数(如 SUM)上造成混乱.截至今天，我首先对加载的数据应用过滤器以删除包含标题的行:

My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers :

affaires    = load 'affaires.csv'   using PigStorage(',') as (NU_AFFA:chararray,    date:chararray) ;
affaires    = filter affaires by date matches '../../..';

我认为作为一种方法有点愚蠢，我想知道有没有办法告诉猪不要加载csv的第一行，就像加载函数的as_header"布尔参数一样.我在文档上没有看到它.什么是最佳实践?你通常如何处理??

I think it is a bit stupid as a method, and I am wondering either there is a way to tell pig not to load the first line of the csv, like a "as_header" boolean parameter to the load function. I don't see it on the doc. What would be a best practice ? How do you usually deal with that ??

推荐答案

CSVExcelStorage 加载器支持跳过标题行，因此使用 CSVExcelStorage<代替 PigStorage/代码>.下载 piggybank.jar 并尝试此选项.


CSVExcelStorage loader support to skip the header row, so instead of PigStorage  use CSVExcelStorage. Download piggybank.jar and try this option.
示例
input.csv 
Name,Age,Location
a,10,chennai
b,20,banglore

PigScript:(带有 SKIP_INPUT_HEADER 选项)
REGISTER '/tmp/piggybank.jar';
A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;

输出:
(a,10,chennai)
(b,20,banglore)

参考:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

                        这篇关于Hadoop Pig - 删除 csv 标头的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Hadoop Pig - 删除 csv 标头 [英] Hadoop Pig - Removing csv header

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Hadoop Pig - 删除 csv 标头 [英] Hadoop Pig - Removing csv header

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭