在 Hadoop Map Reduce 中重命名部分文件 [英] Renaming Part Files in Hadoop Map Reduce

查看:26
本文介绍了在 Hadoop Map Reduce 中重命名部分文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试按照页面 http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

I have tried to use the MultipleOutputs class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

驱动程序代码

    Configuration conf = new Configuration();
    Job job = new Job(conf, "Wordcount");
    job.setJarByClass(WordCount.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
            Text.class, IntWritable.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);

减速器代码

public class WordCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    private MultipleOutputs<Text, IntWritable> mos;
    public void setup(Context context){
        mos = new MultipleOutputs<Text, IntWritable>(context);
    }
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        //context.write(key, result);
        mos.write("text", key,result);
    }
    public void cleanup(Context context)  {
         try {
            mos.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
         }
}

发现reducer的输出重命名为text-r-00000

The output of the reducer is found to rename to text-r-00000

但这里的问题是我也得到了一个空的 part-r-00000 文件.这是 MultipleOutputs 预期的行为方式,还是我的代码有问题?请指教.

But the issue here is that I am also getting an empty part-r-00000 file as well. Is this how MultipleOutputs is expected to behave, or is there some problem with my code? Please advice.

我尝试过的另一种方法是使用 FileSystem 类遍历我的输出文件夹,并手动重命名以 part 开头的所有文件.

Another alternative I have tried out is to iterate through my output folder using the FileSystem class and manually rename all files beginning with part.

最好的方法是什么?

FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
} 
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}

推荐答案

即使你使用的是 MultipleOutputs,默认的 OutputFormat (我相信它是 TextOutputFormat) 仍在使用,因此它将初始化并创建您看到的这些 part-r-xxxxx 文件.

Even if you are using MultipleOutputs, the default OutputFormat (I believe it is TextOutputFormat) is still being used, and so it will initialize and creating these part-r-xxxxx files that you are seeing.

它们是空的事实是因为您没有执行任何context.write,因为您使用的是MultipleOutputs.但这并不妨碍它们在初始化期间被创建.

The fact that they are empty is because you are not doing any context.write because you are using MultipleOutputs. But that doesn't prevent them from being created during initialization.

要摆脱它们,您需要定义 OutputFormat 以表示您不期望任何输出.你可以这样做:

To get rid of them, you need to define your OutputFormat to say you are not expecting any output. You can do it this way:

job.setOutputFormat(NullOutputFormat.class);

设置该属性后,这应该确保您的零件文件根本不会被初始化,但您仍然可以在 MultipleOutputs 中获得输出.

With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs.

您也可以使用 LazyOutputFormat 来确保您的输出文件仅在/如果有数据时创建,而不是初始化空文件.你可以这样做:

You could also probably use LazyOutputFormat which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

请注意,您在 Reducer 中使用了原型 MultipleOutputs.write(String namedOutput, K key, V value),它仅使用默认输出路径根据您的 namedOutput 生成类似于:{namedOutput}-(m|r)-{part-number}.如果你想对你的输出文件名有更多的控制,你应该使用原型 MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) 它可以让你在运行时生成文件名基于您的键/值.

Note that you are using in your Reducer the prototype MultipleOutputs.write(String namedOutput, K key, V value), which just uses a default output path that will be generated based on your namedOutput to something like: {namedOutput}-(m|r)-{part-number}. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) which can allow you to get filenames generated at runtime based on your keys/values.

这篇关于在 Hadoop Map Reduce 中重命名部分文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆