如何将当前输入文件名合并到我的 Pig Latin 脚本中? [英] How can I incorporate the current input filename into my Pig Latin script?

查看:22
本文介绍了如何将当前输入文件名合并到我的 Pig Latin 脚本中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一组文件中的数据,这些文件包含作为文件名一部分的日期戳.文件中的数据不包含日期戳.我想处理文件名并将其添加到脚本中的数据结构之一.有没有办法在 Pig Latin 中做到这一点(可能是 PigStorage 的扩展?),或者我是否需要事先使用 Perl 或类似工具预处理所有文件?

I am processing data from a set of files which contain a date stamp as part of the filename. The data within the file does not contain the date stamp. I would like to process the filename and add it to one of the data structures within the script. Is there a way to do that within Pig Latin (an extension to PigStorage maybe?) or do I need to preprocess all of the files using Perl or the like beforehand?

我的设想如下:

-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);

-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
  REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
  field1, field2;

注意 LOAD 语句中的特殊文件名"数据类型.似乎它必须在那里发生,因为一旦加载了数据,就太晚了,无法返回源文件名.

Note the special "filename" datatype in the LOAD statement. Seems like it would have to happen there as once the data has been loaded it's too late to get back to the source filename.

推荐答案

您可以通过指定 -tagsource 来使用 PigStorage,如下所示

You can use PigStorage by specify -tagsource as following

A = LOAD 'input' using PigStorage(',','-tagsource'); 
B = foreach A generate INPUT_FILE_NAME; 

每个元组中的第一个字段将包含输入路径 (INPUT_FILE_NAME)

The first field in each Tuple will contain input path (INPUT_FILE_NAME)

根据 API 文档 http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

这篇关于如何将当前输入文件名合并到我的 Pig Latin 脚本中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆