在Hadoop中,如何将整个文件作为输入提供给mapper? [英] in Hadoop, How can you give whole file as input to mapper?
问题描述
面试官最近问我这个问题:
An interviewer recently asked me this question:
我说是通过配置块大小或分割大小等于文件大小来实现的.
I said by configuring block size or split size equal to file size.
他说错了.
推荐答案
好吧,如果您这样说,我认为他不喜欢配置块大小"部分.
Well if you told it like that I think that he didn't like the "configuring block size" part.
编辑:以某种方式,我认为更改块大小是一个坏主意,因为它对于HDFS是全局的.
另一方面,防止拆分的解决方案是将最小拆分大小设置为大于要映射的最大文件.
On the other hand a solution to prevent splitting, would be to set the min split size bigger than the largest file to map.
更干净的解决方案是将相关的InputFormat实现子类化.特别是通过重写isSpitable()方法以返回false.在您的情况下,您可以使用FileInputFormat执行以下操作:
A cleaner solution would be to subclass the concerned InputFormat implementation. Especially by overriding the isSpitable() method to return false. In your case you could do something like this with FileInputFormat:
public class NoSplitFileInputFormat extends FileInputFormat
{
@Override
protected boolean isSplitable(JobContext context, Path file)
{
return false;
}
}
这篇关于在Hadoop中,如何将整个文件作为输入提供给mapper?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!