Hadoop 作业从多个目录获取输入文件 [英] Hadoop job taking input files from multiple directories

查看：31 发布时间：2021/12/15 19:15:50 file input hadoop

本文介绍了Hadoop 作业从多个目录获取输入文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一种情况，我在多个目录中存在多个(100+，每个 2-3 MB)压缩 gz 格式的文件.例如
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz

I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz

我必须将所有这些文件输入到一个 Map 作业中.据我所知，使用 MultipleFileInputFormat 所有输入文件都需要在同一目录中.是否可以将多个目录直接传递到作业中?
如果没有，那么是否可以将这些文件有效地放在一个目录中而不会发生命名冲突或将这些文件合并为 1 个单独的压缩 gz 文件.
注意:我使用普通的 java 来实现 Mapper，而不是使用 Pig 或 hadoop 流.

I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.

对上述问题的任何帮助将不胜感激.
谢谢，
安吉特

Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit

Hadoop 作业从多个目录获取输入文件 [英] Hadoop job taking input files from multiple directories

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Hadoop 作业从多个目录获取输入文件 [英] Hadoop job taking input files from multiple directories

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭