如何将当前输入的文件名合并到Pig Latin脚本中? [英] How can I incorporate the current input filename into my Pig Latin script?

查看:78
本文介绍了如何将当前输入的文件名合并到Pig Latin脚本中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理来自一组文件的数据,这些文件包含日期戳作为文件名的一部分.文件中的数据不包含日期戳.我想处理文件名并将其添加到脚本内的数据结构之一.是否可以在Pig Latin中(可能是PigStorage的扩展?)中做到这一点?还是需要预先使用Perl等对所有文件进行预处理?

I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?

我设想类似以下内容:

-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);

-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
  REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
  field1, field2;

请注意LOAD语句中的特殊文件名"数据类型.似乎必须在此发生,因为一旦数据加载完毕,现在又回到源文件名已为时已晚.

Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.

推荐答案

您可以通过如下指定-tagsource来使用PigStorage

You can use PigStorage by specify -tagsource as following

A = LOAD 'input' using PigStorage(',','-tagsource'); 
B = foreach A generate INPUT_FILE_NAME; 

每个元组的第一个字段将包含输入路径(INPUT_FILE_NAME)

The first field in each Tuple will contain input path (INPUT_FILE_NAME)

根据API文档 http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

这篇关于如何将当前输入的文件名合并到Pig Latin脚本中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆