使用 Apache Pig 如何根据标题行从 CSV 中选择和存储列 [英] With Apache Pig how to select and store columns from a CSV according to header line

查看:19
本文介绍了使用 Apache Pig 如何根据标题行从 CSV 中选择和存储列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多 CSV 文件,都带有标题行.这些文件看起来都相似:

I have many CSV files, all with a header line. The files all look similar:

name, gender, preference, ....
peter, m, soap, ...
paul, m, gel, ...
mary, f, soap, ...
.
.
.

但是列位置和确切的标题名称可能会有所不同,例如.另一个文件可能如下所示:

But column positions and exact header names can be a bit different, eg. another file could look like:

"the preferences", "the name", "the gender",....
soap, peter, m, ...
gel, paul, m, ...
soap, mary, f, ...
.
.
.

我只想输出/存储标题包含单词name"的列.这个专栏的位置我事先不知道,因为每个文件可能不同.

I want to output/store only the columns for which the header contains the word "name". The psotion of this column I do not know in advance, because each file can be different.

因此,我需要将每个文件中的与其标题名称相关联.我可以在 Pig 中做到这一点吗?

So, I need to associate the columns in each file with their header names. Can I do this in Pig?

我虽然使用了两个 FILTER 操作符(一个用于标题,一个用于数据),但是用于此的数据是否不必读取两次?

I though of using two FILTER operators (one for the header, one for the data), but does the data for this not have to be read twice?

推荐答案

在流媒体或存储功能中这样做可能更容易.

It would probably be easier to do this in streaming or in a storage function.

见CSVExcelStorage和SKIP_INPUT_HEADER的实现-http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java

See the implementation of CSVExcelStorage and SKIP_INPUT_HEADER - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java

您可以读取文件的标题,找到名称"字段的位置,然后只为文件中的所有其他记录返回该位置的字段.

You could read the header of the file, find the location of the "name" field and then only return the field in that location for all the other records in the file.

您应该确保每个拆分都是一个文件,因为如果文件在任务之间拆分,则处理文件中不包含标题的部分的任务将无法检测到名称"领域.

You should make sure that each split is a single file because if a file is split between tasks the tasks that work on the parts of the file that don't contain the header wouldn't be able to detect the "name" field.

这篇关于使用 Apache Pig 如何根据标题行从 CSV 中选择和存储列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆