添加文件夹名称以输出Pig Latin [英] Add folder name to output Pig Latin
问题描述
我在HDFS中具有下一个目录结构:
I have next directory structure in HDFS:
logs_folder
|---2021-03-01
|---log1
|---log2
|---log3
2021-03-02
|---log1
|---log2
2021-03-03
|---log1
|---log2
...
日志由文本数据组成.数据中没有日期,因为它已经在文件夹名称中.我想读取所有日志并将其保存为以下格式:
Logs are made up of text data. There is no date in the data because it is already in the folder name. I want to read all the logs and save them in the following format:
date id
其中id-日志中的字段,但我需要从文件夹名称中获取日期.预期输出:
where id - field from the log, but I need to take the date from the folder name. Expected output:
2021-03-01 id1
2021-03-01 id2
...
2021-03-02 id234
2021-03-02 id456
...
如何从文件夹名称添加日期到输出?
我发现一个严峻的问题,即如何在读取时向数据添加完整的路径名:
I found close question how to add full pathname to data on reading:
A = LOAD '/logs_folder/*' using PigStorage(',','-tagPath');
DUMP A ;
我如何将当前输入的文件名合并到我的Pig Latin脚本中?
它非常接近,但是如何仅获取父文件夹名称而不是完整路径?
It is very close, but how to get parent folder name only instead of full path?
推荐答案
最后,我使用了这种方法:
Finally I used this approach:
- 使用`-tagPathz属性加载数据-它将列添加到已加载数据中,其中包含每个文件的完整路径
- 使用正则表达式仅过滤父文件夹
代码示例:
hadoop_data = LOAD '/logs_folder/*' USING PigStorage(',', '-tagPath') as (filepath:chararray, id:chararray, feature:chararray, value:chararray);
hadoop_data = FOREACH hadoop_data GENERATE id,(chararray)REGEX_EXTRACT(filepath,'.*\\/(.*)\\/',1) as path,
feature,value;
我的数据包含3个字段-ID,功能,值,但是您可以看到其中有4个字段-添加了 filepath
字段!
My data consist of 3 fields - id, feature, value, but you can see there are 4 of them - filepath
field was added!
这篇关于添加文件夹名称以输出Pig Latin的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!