在 Hadoop Map Reduce 中重命名部分文件 [英] Renaming Part Files in Hadoop Map Reduce
问题描述
I have tried to use the MultipleOutputs
class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
驱动程序代码
Configuration conf = new Configuration();
Job job = new Job(conf, "Wordcount");
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
Text.class, IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
减速器代码
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
private MultipleOutputs<Text, IntWritable> mos;
public void setup(Context context){
mos = new MultipleOutputs<Text, IntWritable>(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
//context.write(key, result);
mos.write("text", key,result);
}
public void cleanup(Context context) {
try {
mos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
发现reducer的输出重命名为text-r-00000
The output of the reducer is found to rename to text-r-00000
但这里的问题是我也得到了一个空的 part-r-00000 文件.这是 MultipleOutputs 预期的行为方式,还是我的代码有问题?请指教.
But the issue here is that I am also getting an empty part-r-00000 file as well. Is this how MultipleOutputs is expected to behave, or is there some problem with my code? Please advice.
我尝试过的另一种方法是使用 FileSystem 类遍历我的输出文件夹,并手动重命名以 part 开头的所有文件.
Another alternative I have tried out is to iterate through my output folder using the FileSystem class and manually rename all files beginning with part.
最好的方法是什么?
FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
}
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}
推荐答案
即使你使用的是 MultipleOutputs
,默认的 OutputFormat
(我相信它是 TextOutputFormat
) 仍在使用,因此它将初始化并创建您看到的这些 part-r-xxxxx
文件.
Even if you are using MultipleOutputs
, the default OutputFormat
(I believe it is TextOutputFormat
) is still being used, and so it will initialize and creating these part-r-xxxxx
files that you are seeing.
它们是空的事实是因为您没有执行任何context.write
,因为您使用的是MultipleOutputs
.但这并不妨碍它们在初始化期间被创建.
The fact that they are empty is because you are not doing any context.write
because you are using MultipleOutputs
. But that doesn't prevent them from being created during initialization.
要摆脱它们,您需要定义 OutputFormat
以表示您不期望任何输出.你可以这样做:
To get rid of them, you need to define your OutputFormat
to say you are not expecting any output. You can do it this way:
job.setOutputFormat(NullOutputFormat.class);
设置该属性后,这应该确保您的零件文件根本不会被初始化,但您仍然可以在 MultipleOutputs
中获得输出.
With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs
.
您也可以使用 LazyOutputFormat
来确保您的输出文件仅在/如果有数据时创建,而不是初始化空文件.你可以这样做:
You could also probably use LazyOutputFormat
which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
请注意,您在 Reducer
中使用了原型 MultipleOutputs.write(String namedOutput, K key, V value)
,它仅使用默认输出路径根据您的 namedOutput
生成类似于:{namedOutput}-(m|r)-{part-number}
.如果你想对你的输出文件名有更多的控制,你应该使用原型 MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)
它可以让你在运行时生成文件名基于您的键/值.
Note that you are using in your Reducer
the prototype MultipleOutputs.write(String namedOutput, K key, V value)
, which just uses a default output path that will be generated based on your namedOutput
to something like: {namedOutput}-(m|r)-{part-number}
. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath)
which can allow you to get filenames generated at runtime based on your keys/values.
这篇关于在 Hadoop Map Reduce 中重命名部分文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!