Apache Pig如何根据标题行从CSV中选择和存储列 [英] With Apache Pig how to select and store columns from a CSV according to header line

查看:243
本文介绍了Apache Pig如何根据标题行从CSV中选择和存储列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多CSV文件,都带有标题行。这些文件全部看起来像

I have many CSV files, all with a header line. The files all look similar:

name, gender, preference, ....
peter, m, soap, ...
paul, m, gel, ...
mary, f, soap, ...
.
.
.

但是列位置和确切的标题名称可能有点不同,例如。另一个文件可能如下所示:

But column positions and exact header names can be a bit different, eg. another file could look like:

"the preferences", "the name", "the gender",....
soap, peter, m, ...
gel, paul, m, ...
soap, mary, f, ...
.
.
.

我只想输出/存储标题包含单词名称。我不知道这个专栏的psotion,因为每个文件都可以不同。

I want to output/store only the columns for which the header contains the word "name". The psotion of this column I do not know in advance, because each file can be different.

因此,我需要将每个文件中的与其标题名称关联。我可以在猪身上做这个吗?

So, I need to associate the columns in each file with their header names. Can I do this in Pig?

我使用两个 FILTER 运算符(一个用于标题,一个用于数据),但是这样的数据不必被读取两次?

I though of using two FILTER operators (one for the header, one for the data), but does the data for this not have to be read twice?

推荐答案

在流式传输或存储功能中执行此操作可能会更容易

It would probably be easier to do this in streaming or in a storage function.

请参阅CSVExcelStorage的执行和SKIP_INPUT_HEADER - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank /java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java

See the implementation of CSVExcelStorage and SKIP_INPUT_HEADER - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java

您可以阅读文件的标题,找到name字段的位置,然后只返回该位置的字段中的所有其他记录。

You could read the header of the file, find the location of the "name" field and then only return the field in that location for all the other records in the file.

您应该确保每个分割都是一个文件,因为如果一个文件在任务之间分配在该部分上工作的任务不包含标题的文件的s将无法检测到名称字段。

You should make sure that each split is a single file because if a file is split between tasks the tasks that work on the parts of the file that don't contain the header wouldn't be able to detect the "name" field.

这篇关于Apache Pig如何根据标题行从CSV中选择和存储列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆