Hadoop HDFS MapReduce输出到MongoDb [英] Hadoop HDFS MapReduce output into MongoDb
问题描述
我想编写Java程序,它从HDFS读取输入,使用MapReduce处理它并将输出写入到MongoDb中。
以下是这种情况:
- 我有一个拥有3个datanode的Hadoop集群。
- Java程序从HDFS读取输入,使用MapReduce处理它。
- 最后,将结果写入MongoDb中。
从HDFS读取和使用MapReduce处理它很简单。但我陷入了将结果写入MongoDb的困境。是否有任何Java API支持将结果写入MongoDB?另一个问题是,由于它是一个Hadoop集群,所以我们不知道哪个datanode将运行Reducer任务并生成结果,是否可以将结果写入安装在特定服务器上的MongoDb?
如果我想将结果写入HDFS,代码将如下所示:
public void reduce(Text key,Iterable< LongWritable> values,Context context)throws IOException,InterruptedException
{
long sum = 0; (LongWritable value:values)
{
sum + = value.get();
context.write(new Text(key),new LongWritable(sum));
}
现在我想将结果写入MongoDb而不是HDFS,怎么能我这样做?
只需在您的Reducer中添加代码即可,作为副作用,可将数据插入数据库。避免这种诱惑。与使用reducer类的副作用插入数据相比,使用连接器的一个原因是推测性执行:Hadoop有时可以并行运行两个完全相同的reduce任务,这可能会导致无关的插入和重复数据。 p>
I want to write Java program which reads input from HDFS, processes it using MapReduce and writes the output into a MongoDb.
Here is the scenario:
- I have a Hadoop Cluster which has 3 datanodes.
- A java program reads the input from the HDFS, processes it using MapReduce.
- Finally, write the result into a MongoDb.
Actually, reading from HDFS and processing it with MapReduce are simple. But I gets stuck about writing the result into a MongoDb. Is there any Java API supported to write the result into MongoDB? Another question is that since it is a Hadoop Cluster, so we don't know which datanode will run the Reducer task and generate the result, is it possible to write the result into a MongoDb which is installed on a specific server?
If I want to write the result into HDFS, the code will be like this:
@Override
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException
{
long sum = 0;
for (LongWritable value : values)
{
sum += value.get();
}
context.write(new Text(key), new LongWritable(sum));
}
Now I want to write the result into a MongoDb instead of HDFS, how can I do that?
You want «MongoDB Connector for Hadoop». The examples.
It's tempting to just add code in your Reducer that, as a side effect, inserts data into your database. Avoid this temptation. One reason to use a connector as opposed to just inserting data as a side effect of your reducer class is speculative execution: Hadoop can sometimes run two of the exact same reduce tasks in parallel, which can lead to extraneous inserts and duplicate data.
这篇关于Hadoop HDFS MapReduce输出到MongoDb的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!