从HDFS复制到S3时使用GroupBy合并文件夹内的文件 [英] Using GroupBy while copying from HDFS to S3 to merge files within a folder
问题描述
hdfs:// xxxx:8020 / Air / BOOK / AE / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / AE / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / BH / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / IN / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / IN / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / KW / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / KW / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / ME / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / OM / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / Others / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / QA / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / QA / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / BOOK / SA / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / AE / DOM / 20171001/2017100101
hdfs:// xx .xx:8020 / Air / SEARCH / AE / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / BH / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / BH / INT / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / IN / DOM / 20171001/2017100101
hdfs:// xxxx:8020 / Air / SEARCH / IN / INT / 20171001/2017100101
每个文件夹中有近50个文件。我的意图是合并文件夹中的所有文件以获取单个文件,同时从HDFS在S3上复制文件。我遇到的问题是使用 groupBy 选项的正则表达式。我尝试了这一点,但这似乎并不奏效:
s3-dist-cp --src hdfs:/// Air / --dest s3a:// HadoopSplit / Air-merged / --groupBy'。* /(\ w +)/(\ w +)/(\ w +)/.*'--outputCodec lzo
但我没有得到每个文件夹中的文件合并成一个单一的文件,这让我相信这个问题是与我的正则表达式。
pre $
。* / Air /(\ w +)/ (\w +)/(\w +)/.*/.*/.*
和合并和复制的命令是:
s3-dist-cp --src hdfs:/// Air / --dest s3a:// HadoopSplit / Air-merged / --groupBy'。* / Air /(\w +)/(\w +)/(\w +)/.*/.*/.*'--outputCodec lzo
I have the following folders in HDFS :
hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/OM/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/Others/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/INT/20171001/2017100101
Each folder has close to 50 files in it.My intention is to merge all the files within a folder to get a single file while copying it on S3 from HDFS. The issue I am having is with the regex with the groupBy option.I tried this and this does not seem to work :
s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/ --groupBy '.*/(\w+)/(\w+)/(\w+)/.*' --outputCodec lzo
The command works per se but i don't get the files within each folder merged into a single file which makes me believe that the issue is with my regex.
i figured this out myself only..the correct regex is
.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*
and the command to merge and copy is :
s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/ --groupBy '.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*' --outputCodec lzo
这篇关于从HDFS复制到S3时使用GroupBy合并文件夹内的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!