使用 Apache Pig 如何根据标题行从 CSV 中选择和存储列 [英] With Apache Pig how to select and store columns from a CSV according to header line
问题描述
我有很多 CSV 文件,都带有标题行.这些文件看起来都相似:
I have many CSV files, all with a header line. The files all look similar:
name, gender, preference, ....
peter, m, soap, ...
paul, m, gel, ...
mary, f, soap, ...
.
.
.
但是列位置和确切的标题名称可能会有所不同,例如.另一个文件可能如下所示:
But column positions and exact header names can be a bit different, eg. another file could look like:
"the preferences", "the name", "the gender",....
soap, peter, m, ...
gel, paul, m, ...
soap, mary, f, ...
.
.
.
我只想输出/存储标题包含单词name
"的列.这个专栏的位置我事先不知道,因为每个文件可能不同.
I want to output/store only the columns for which the header contains the word "name
". The psotion of this column I do not know in advance, because each file can be different.
因此,我需要将每个文件中的列与其标题名称相关联.我可以在 Pig 中做到这一点吗?
So, I need to associate the columns in each file with their header names. Can I do this in Pig?
我虽然使用了两个 FILTER
操作符(一个用于标题,一个用于数据),但是用于此的数据是否不必读取两次?
I though of using two FILTER
operators (one for the header, one for the data), but does the data for this not have to be read twice?
推荐答案
在流媒体或存储功能中这样做可能更容易.
It would probably be easier to do this in streaming or in a storage function.
见CSVExcelStorage和SKIP_INPUT_HEADER的实现-http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
See the implementation of CSVExcelStorage and SKIP_INPUT_HEADER - http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
您可以读取文件的标题,找到名称"字段的位置,然后只为文件中的所有其他记录返回该位置的字段.
You could read the header of the file, find the location of the "name" field and then only return the field in that location for all the other records in the file.
您应该确保每个拆分都是一个文件,因为如果文件在任务之间拆分,则处理文件中不包含标题的部分的任务将无法检测到名称"领域.
You should make sure that each split is a single file because if a file is split between tasks the tasks that work on the parts of the file that don't contain the header wouldn't be able to detect the "name" field.
这篇关于使用 Apache Pig 如何根据标题行从 CSV 中选择和存储列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!