从HDFS复制到S3时使用GroupBy合并文件夹内的文件 [英] Using GroupBy while copying from HDFS to S3 to merge files within a folder

查看:234
本文介绍了从HDFS复制到S3时使用GroupBy合并文件夹内的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HDFS中有以下文件夹:

  hdfs:// xxxx:8020 / Air / BOOK / AE / DOM / 20171001/2017100101 
hdfs:// xxxx:8020 / Air / BOOK / AE / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / BH / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / IN / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / IN / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / KW / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / KW / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / ME / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / OM / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / Others / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / QA / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / QA / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / SA / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / AE / DOM / 20171001/2017100101
hdfs:// xx .xx:8020 / Air / SEARCH / AE / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / BH / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / BH / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / IN / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / IN / INT / 20171001/2017100101

每个文件夹中有近50个文件。我的意图是合并文件夹中的所有文件以获取单个文件,同时从HDFS在S3上复制文件。我遇到的问题是使用 groupBy 选项的正则表达式。我尝试了这一点,但这似乎并不奏效:

  s3-dist-cp --src hdfs:/// Air / --dest s3a:// HadoopSplit / Air-merged / --groupBy'。* /(\ w +)/(\ w +)/(\ w +)/.*'--outputCodec lzo 

但我没有得到每个文件夹中的文件合并成一个单一的文件,这让我相信这个问题是与我的正则表达式。

解决方案

pre $ 。* / Air /(\ w +)/ (\w +)/(\w +)/.*/.*/.*

和合并和复制的命令是:

  s3-dist-cp --src hdfs:/// Air / --dest s3a:// HadoopSplit / Air-merged / --groupBy'。* / Air /(\w +)/(\w +)/(\w +)/.*/.*/.*'--outputCodec lzo 


I have the following folders in HDFS :

hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/OM/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/Others/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/INT/20171001/2017100101

Each folder has close to 50 files in it.My intention is to merge all the files within a folder to get a single file while copying it on S3 from HDFS. The issue I am having is with the regex with the groupBy option.I tried this and this does not seem to work :

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/  --groupBy '.*/(\w+)/(\w+)/(\w+)/.*' --outputCodec lzo

The command works per se but i don't get the files within each folder merged into a single file which makes me believe that the issue is with my regex.

解决方案

i figured this out myself only..the correct regex is

.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*

and the command to merge and copy is :

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/  --groupBy '.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*' --outputCodec lzo

这篇关于从HDFS复制到S3时使用GroupBy合并文件夹内的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆