控制MultipleOutputFormat文件子路径 [英] Control the MultipleOutputFormat files sub-path
问题描述
我需要根据Reducer键控制由MultipleOutputFormat管理的不同文件的子路径。
基本上我想根据给定的缩放器的键来设置文件的子路径。
我可以通过重写MultipleOutputFormatbut的generateFileNameForKeyValue方法来更改文件名,但我怎样才能更改这些文件的子路径?
我的意思是重写generateFileNameForKeyValue,我得到
mySetJobConfigOutputPath / fileNameBasedKey1.dat
/ fileNameBasedKey2.dat
/fileNameBasedKey3.dat
...
但我想使它成为下面的组织文件
mySetJobConfigOutputPath / path0ConfiguredInsideReducerBasedOnKey / fileNameBasedKey1.dat
/ path1ConfiguredInsideReducerBasedOnKey /fileNameBasedKey2.dat
/fileNameBasedKey3.dat
/path2ConfiguredInsideReducerBasedOnKey/fileNameBasedKey8.dat
as可以看出,子路径和文件名都是通过减速器内部的键来计算出来的。
我知道如何配置文件名,但想知道是否可以在mySetJobConfigOutputPath文件夹下配置每个文件的子路径?
$ b $我发现我也可以覆盖 getInputFileBasedOutputFileName 方法,并为其指定子方法, @Override
protected String getInputFileBasedOutputFileName(JobConf conf,String Name)
{
//你的逻辑在这里。只需添加名称的子路径并返回
}
您仍然应该执行 generateFileNameForKeyValue 将您的主档案名称转换为密钥
UPDATE:基本上这解释了所有 http://www.infoq.com/articles/HadoopOutputFormat
I need to control the sub-path of the different different files being managed by MultipleOutputFormat based on the reducer key.
I basically want to set the sub path of the file based on the key given to the reducer.
I can changed the file name by overwrting the generateFileNameForKeyValue method of MultipleOutputFormatbut how can I also change the sub-path of these files?
I mean with just overriding the generateFileNameForKeyValue, I get
mySetJobConfigOutputPath/fileNameBasedKey1.dat
/fileNameBasedKey2.dat
/fileNameBasedKey3.dat
...
but I want to make it to be organize files like below
mySetJobConfigOutputPath/path0ConfiguredInsideReducerBasedOnKey/fileNameBasedKey1.dat
/path1ConfiguredInsideReducerBasedOnKey/fileNameBasedKey2.dat
/fileNameBasedKey3.dat
/path2ConfiguredInsideReducerBasedOnKey/fileNameBasedKey8.dat
as seen, the sub-path and the file name are both figured out by the key inside the reducer.
I know how to configure the file name but was wondering if I can configure the sub-path of the each file under the mySetJobConfigOutputPath folder?
I found out that that I can override the getInputFileBasedOutputFileName method also and give it the sub-Path in there.
@Override
protected String getInputFileBasedOutputFileName(JobConf conf, String Name)
{
//your logic goes here. Simply addd the sub path to the name and return
}
You should still implement the generateFileNameForKeyValue to convert your lead file name to the key
UPDATE: Basically this explains it all http://www.infoq.com/articles/HadoopOutputFormat
这篇关于控制MultipleOutputFormat文件子路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!