写入hadoop中的多个文件夹? [英] Writing to multiple folders in hadoop?

查看:103
本文介绍了写入hadoop中的多个文件夹?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 我的dirver的代码如下:
FileOutputFormat.setOutputPath(job,new Path(output));
//MultipleOutputs.addNamedOutput(job,namedOutput,outputFormatClass,keyClass,valueClass)
//MultipleOutputs.addNamedOutput (job,namedOutput,outputFormatClass,keyClass,valueClass)
MultipleOutputs.addNamedOutput(job, foo,TextOutputFormat.class,NullWritable.class,Text.class);
MultipleOutputs.addNamedOutput(job,bar,TextOutputFormat.class,Text.class,NullWritable.class);
MultipleOutputs.addNamedOutput(job,foobar,TextOutputFormat.class,Text.class,NullWritable.class);

然后我的reducer有如下代码:
mos.write(foo,NullWritable.get(),new Text(jsn.toString()));
mos.write(bar,key,NullWritable.get());
mos.write(foobar,key,NullWritable.get());

但是在输出中,我看到:

output / foo-r-0001
output / foo-r-0002
output / foobar- r-0001
输出/ bar-r-0001


但是我想要的是:

output / foo / part-r-0001
output / foo / part-r-0002
output / bar / part-r-0001

输出/ foobar / part-r-0001



我该怎么做?
谢谢

解决方案

如果您的意思是MultipleOutputs ,最简单的方法是从您的reducer执行以下任一操作 -


  1. 使用具有基本输出路径的命名输出。 请参阅此功能 没有命名输出且仅使用基本输出路径查看此功能

在你的情况中,它是第一点,所以,请更改以下内容 -

  mos.write(foo,NullWritable.get(),new Text(jsn.toString())); 
mos.write(bar,key,NullWritable.get());
mos.write(foobar,key,NullWritable.get());

to,

  mos.write(foo,NullWritable.get(),new Text(jsn.toString()),foo / part); 
mos.write(bar,key,NullWritable.get(),bar / part);
mos.write(foobar,key,NullWritable.get(),foobar / part);

其中,foo / partbar /部分foobar / part对应于baseOutputPath。
因此,将会创建目录foo,bar和foobar,并在 part-r-xxxxx 文件中创建。

尝试点2以上,实际上不需要任何命名的输出。

如果需要,请回到我的进一步说明。


I am trying to separate my output from reducer to different folders..

My dirver has the following code:
 FileOutputFormat.setOutputPath(job, new Path(output));
            //MultipleOutputs.addNamedOutput(job, namedOutput, outputFormatClass, keyClass, valueClass)
            //MultipleOutputs.addNamedOutput(job, namedOutput, outputFormatClass, keyClass, valueClass)
            MultipleOutputs.addNamedOutput(job, "foo", TextOutputFormat.class, NullWritable.class, Text.class);
            MultipleOutputs.addNamedOutput(job, "bar", TextOutputFormat.class, Text.class,NullWritable.class);
            MultipleOutputs.addNamedOutput(job, "foobar", TextOutputFormat.class, Text.class, NullWritable.class);

And then my reducer has the following code:
mos.write("foo",NullWritable.get(),new Text(jsn.toString()));
mos.write("bar", key,NullWritable.get());
mos.write("foobar", key,NullWritable.get());

But in the output, I see:

output/foo-r-0001
output/foo-r-0002
output/foobar-r-0001
output/bar-r-0001


But what I am trying is :

output/foo/part-r-0001
output/foo/part-r-0002
output/bar/part-r-0001

output/foobar/part-r-0001

How do I do this? Thanks

解决方案

If you mean this MultipleOutputs, the simplest way would be to do one of the following from you reducer --

  1. Using named output with a base output path. See this function.
  2. Without named output and using only a base output path, See this function

In your case, it's point 1, so, please change the following --

mos.write("foo",NullWritable.get(),new Text(jsn.toString()));
mos.write("bar", key,NullWritable.get());
mos.write("foobar", key,NullWritable.get());

to,

mos.write("foo",NullWritable.get(),new Text(jsn.toString()), "foo/part");
mos.write("bar", key,NullWritable.get(), "bar/part");
mos.write("foobar", key,NullWritable.get(), "foobar/part");

Where, "foo/part", "bar/part" and "foobar/part" corresponds to the baseOutputPath. Hence, directories foo, bar and foobar would be created and inside that part-r-xxxxx files.

You might also try point 2 above, which actually don't need any named output.

Please get back to me for further clarification, if needed.

这篇关于写入hadoop中的多个文件夹?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆