在Hadoop Map Reduce中重命名零件文件 [英] Renaming Part Files in Hadoop Map Reduce

查看:109
本文介绍了在Hadoop Map Reduce中重命名零件文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图按照 MultipleOutputs 类.21.0 / API / index.html中?组织/阿帕奇/ hadoop的/映射精简/ LIB /输出/ MultipleOutputs.html> http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html? org / apache / hadoop / mapreduce / lib / output / MultipleOutputs.html

驱动程序代码

 配置conf = new Configuration(); 
工作职位=新职位(conf,Wordcount);
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job,new Path(args [0]));
FileOutputFormat.setOutputPath(job,new Path(args [1]));
MultipleOutputs.addNamedOutput(job,text,TextOutputFormat.class,
Text.class,IntWritable.class);
System.exit(job.waitForCompletion(true)?0:1);

缩减代码

  public class WordCountReducer extends 
Reducer< Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
私人MultipleOutputs<文字,IntWritable> MOS;
public void setup(Context context){
mos = new MultipleOutputs< Text,IntWritable>(context);
}
public void reduce(Text key,Iterable< IntWritable> values,Context context)
throws IOException,InterruptedException {
int sum = 0; (IntWritable val:values)
{
sum + = val.get();
}
result.set(sum);
//context.write(key,result);
mos.write(text,key,result);

public void cleanup(Context context){
try {
mos.close();
} catch(IOException e){
// TODO自动生成的catch块
e.printStackTrace();
} catch(InterruptedException e){
// TODO自动生成的catch块
e.printStackTrace();
}
}
}

减速器的输出是发现重命名为文本-r-00000



但是这里的问题是我也得到一个空的part-r-00000文件。这是如何MultipleOutputs预期行为,或者是有我的代码有问题?请咨询。

我试过的另一种方法是使用FileSystem类遍历我的输出文件夹,并手动重命名以part开头的所有文件。



最好的方法是什么?

  FileSystem hdfs = FileSystem.get(configuration); 
FileStatus fs [] = hdfs.listStatus(new Path(outputPath)); (FileStatus aFile:fs){
if(aFile.isDir()){
hdfs.delete(aFile.getPath(),true);
//删除输出目录中的所有目录和子目录(如果有的话)

else {
if(aFile.getPath()。getName()。contains _))
hdfs.delete(aFile.getPath(),true);
//删除输出目录中的所有日志文件和_SUCCESS文件
else {
hdfs.rename(aFile.getPath(),new Path(myCustomName));



$ div $解析方案

如果你使用 MultipleOutputs ,默认的 OutputFormat (我相信它是 TextOutputFormat )仍然在使用,所以它会初始化并创建你看到的 part-r-xxxxx 文件。



它们为空的事实是因为您没有执行任何 context.write ,因为您使用的是 MultipleOutputs 。但是这并不妨碍它们在初始化期间被创建。



为了摆脱它们,你需要定义你的 OutputFormat 表示您不期待任何输出。你可以这样做:

  job.setOutputFormat(NullOutputFormat.class); 

使用该属性集,这应该确保您的零件文件根本不会被初始化,但是您仍然你可以使用 LazyOutputFormat c $ c>这将确保您的输出文件只在/有数据时创建,并且不会初始化空文件。你可以这样做:

pre $ import $ or $.
LazyOutputFormat.setOutputFormatClass(job,TextOutputFormat.class);

请注意,您正在使用 Reducer 原型 MultipleOutputs.write(String namedOutput,K key,V value),它只是使用默认输出路径,该路径将根据 namedOutput 类似于: {namedOutput} - (m | r) - {part-number} 。如果你想更好地控制你的输出文件名,你应该使用原型 MultipleOutputs.write(String namedOutput,K key,V value,String baseOutputPath),它可以让你以获取基于您的键/值在运行时生成的文件名。


I have tried to use the MultipleOutputs class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

Driver Code

    Configuration conf = new Configuration();
    Job job = new Job(conf, "Wordcount");
    job.setJarByClass(WordCount.class);
    job.setInputFormatClass(TextInputFormat.class);
    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
            Text.class, IntWritable.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);

Reducer Code

public class WordCountReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    private MultipleOutputs<Text, IntWritable> mos;
    public void setup(Context context){
        mos = new MultipleOutputs<Text, IntWritable>(context);
    }
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        //context.write(key, result);
        mos.write("text", key,result);
    }
    public void cleanup(Context context)  {
         try {
            mos.close();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
         }
}

The output of the reducer is found to rename to text-r-00000

But the issue here is that I am also getting an empty part-r-00000 file as well. Is this how MultipleOutputs is expected to behave, or is there some problem with my code? Please advice.

Another alternative I have tried out is to iterate through my output folder using the FileSystem class and manually rename all files beginning with part.

What is the best way?

FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
} 
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}

解决方案

Even if you are using MultipleOutputs, the default OutputFormat (I believe it is TextOutputFormat) is still being used, and so it will initialize and creating these part-r-xxxxx files that you are seeing.

The fact that they are empty is because you are not doing any context.write because you are using MultipleOutputs. But that doesn't prevent them from being created during initialization.

To get rid of them, you need to define your OutputFormat to say you are not expecting any output. You can do it this way:

job.setOutputFormat(NullOutputFormat.class);

With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs.

You could also probably use LazyOutputFormat which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

Note that you are using in your Reducer the prototype MultipleOutputs.write(String namedOutput, K key, V value), which just uses a default output path that will be generated based on your namedOutput to something like: {namedOutput}-(m|r)-{part-number}. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) which can allow you to get filenames generated at runtime based on your keys/values.

这篇关于在Hadoop Map Reduce中重命名零件文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆