Hadoop作业从多个目录中获取输入文件 [英] Hadoop job taking input files from multiple directories
问题描述
我有一种情况,我在多个目录中存在多个压缩gz格式的多个文件(100多个,每个文件大小为2-3 MB)。例如
A1 / B1 / C1 / part-0000.gz
A2 / B2 / C2 / part-0000.gz
A1 / B1 / C1 /part-0001.gz
我必须将所有这些文件提供给一个Map作业。从我看到的情况来看,对于使用MultipleFileInputFormat,所有输入文件都需要位于同一个目录中。是否可以将多个目录直接传递到作业?
如果没有,那么是否有可能将这些文件有效地放到一个目录中而没有命名冲突或将这些文件合并到一个压缩的gz文件中。
注意:我使用纯java实现Mapper,而不是使用Pig或hadoop流。
有关上述问题的任何帮助将深表赞赏。
感谢,
Ankit
FileInputFormat.addInputPaths(foo / file1.gz,bar / file2.gz )
I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz
I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.
Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit
FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like
FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")
这篇关于Hadoop作业从多个目录中获取输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!