是否可以读取MongoDB数据,使用Hadoop处理数据,并将其输出到RDBS(MySQL)中? [英] Is it possible to read MongoDB data, process it with Hadoop, and output it into a RDBS (MySQL)?

查看:1103
本文介绍了是否可以读取MongoDB数据,使用Hadoop处理数据,并将其输出到RDBS(MySQL)中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


总结:



有可能:


  1. 使用MongoDB Connector for Hadoop将数据导入Hadoop中。

  2. 使用Hadoop MapReduce处理数据。

  3. 使用Sqoop导出


我正在用MongoDB构建一个Web应用程序。尽管MongoDB在大多数工作中都能很好地工作,但在某些部分我需要更强大的事务保证,为此我使用MySQL数据库。



我的问题是我想阅读一个用于数据分析的大型MongoDB集合,但收集的大小意味着分析工作需要很长时间才能处理。不幸的是,MongoDB内置的map-reduce框架对于这项工作并不适用,所以我宁愿使用 Apache Hadoop



据我所知,通过使用«MongoDB Connector for Hadoop»,它从MongoDB读取数据,在Hadoop中使用MapReduce处理数据,最后将结果输出回MongoDB数据库。

问题是我想让MapReduce的输出进入MySQL数据库而不是MongoDB,因为结果必须与其他MySQL表格合并。



为此,我知道Sqoop可以将Hadoop MapReduce的结果导出到MySQL中。



最终,我想要读取MongoDB数据,然后用Hadoop处理它,并最终将结果输出到MySQL数据库中。



这可能吗?哪些工具可以做到这一点?

解决方案


TL; DR:设置一个在你的Hadoop作业中写入RDBS的输出格式化程序:

  job.setOutputFormatClass(DBOutputFormat.class); 


有几件事值得注意:


  1. 使用Sqoop将数据从MongoDB导出到Hadoop是不可能的。这是因为Sqoop使用提供调用级API的 JDBC for 基于SQL的数据库,但MongoDB 不是基于SQL的数据库。您可以查看«MongoDB Connector for Hadoop»来完成这项工作。该连接器可在 GitHub 上获得。 (编辑:正如你在更新中指出的那样)。

  2. 默认情况下,Sqoop导出不是在单个事务中进行的。相反,根据 Sqoop文档


    由于Sqoop将导出过程分解为多个事务,导致失败的导出作业可能导致部分数据被提交到数据库。这可能进一步导致后续作业由于在某些情况下插入冲突而失败,或导致其他数据中的重复数据。您可以通过 - staging-table 选项指定临时表来克服此问题,该选项充当用于分阶段导出数据的辅助表。最后,将暂存的数据移动到单个事务中的目标表中。


  3. Hadoop的MongoDB连接器不包含似乎强制你描述的工作流程。根据文档:


    这种连接的形式允许将MongoDB数据读入Hadoop(用于MapReduce作业以及其他Hadoop生态系统的组件),以及将Hadoop作业的结果写入MongoDB。



  4. 的确,据我所知,从«MongoDB Connector for Hadoop»:示例,可以指定一个 org.apache.hadoop.mapred.lib.db.DBOutputFormat 添加到您的Hadoop MapReduce作业中,将输出写入MySQL数据库。在连接器存储库的示例之后:

      job.setMapperClass(TokenizerMapper.class); 
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setInputFormatClass(MongoInputFormat.class);
    / *而不是:
    * job.setOutputFormatClass(MongoOutputFormat.class);
    *我们使用OutputFormatClass将作业结果
    *写入MySQL数据库。请注意,以下OutputFormat
    *仅将* key *写入数据库,但原则
    *对所有输出格式化程序保持不变
    * /
    job.setOutputFormatClass( DBOutputFormat.class);



Summary:

Is it possible to:

  1. Import data into Hadoop with the «MongoDB Connector for Hadoop».
  2. Process it with Hadoop MapReduce.
  3. Export it with Sqoop in a single transaction.

I am building a web application with MongoDB. While MongoDB work well for most of the work, in some parts I need stronger transactional guarantees, for which I use a MySQL database.

My problem is that I want to read a big MongoDB collection for data analysis, but the size of the collection means that the analytic job would take too long to process. Unfortunately, MongoDB's built-in map-reduce framework would not work well for this job, so I would prefer to carry out the analysis with Apache Hadoop.

I understand that it is possible read data from MongoDB into Hadoop by using the «MongoDB Connector for Hadoop», which reads data from MongoDB, processes it with MapReduce in Hadoop, and finally outputs the results back into a MongoDB database.

The problem is that I want the output of the MapReduce to go into a MySQL database, rather than MongoDB, because the results must be merged with other MySQL tables.

For this purpose I know that Sqoop can export result of a Hadoop MapReduce into MySQL.

Ultimately, I want too read MongoDB data then process it with Hadoop and finally output the result into a MySQL database.

Is this possible? Which tools are available to do this?

解决方案

TL;DR: Set an an output formatter that writes to a RDBS in your Hadoop job:

 job.setOutputFormatClass( DBOutputFormat.class );

Several things to note:

  1. Exporting data from MongoDB to Hadoop using Sqoop is not possible. This is because Sqoop uses JDBC which provides a call-level API for SQL-based database, but MongoDB is not an SQL-based database. You can look at the «MongoDB Connector for Hadoop» to do this job. The connector is available on GitHub. (Edit: as you point out in your update.)

  2. Sqoop exports are not made in a single transaction by default. Instead, according to the Sqoop docs:

    Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.

  3. The «MongoDB Connector for Hadoop» does not seem to force the workflow you describe. According to the docs:

    This connectivity takes the form of allowing both reading MongoDB data into Hadoop (for use in MapReduce jobs as well as other components of the Hadoop ecosystem), as well as writing the results of Hadoop jobs out to MongoDB.

  4. Indeed, as far as I understand from the «MongoDB Connector for Hadoop»: examples, it would be possible to specify a org.apache.hadoop.mapred.lib.db.DBOutputFormat into your Hadoop MapReduce job to write the output to a MySQL database. Following the example from the connector repository:

    job.setMapperClass( TokenizerMapper.class );
    job.setCombinerClass( IntSumReducer.class );
    job.setReducerClass( IntSumReducer.class );
    job.setOutputKeyClass( Text.class );
    job.setOutputValueClass( IntWritable.class );
    job.setInputFormatClass( MongoInputFormat.class );
    /* Instead of:
     * job.setOutputFormatClass( MongoOutputFormat.class );
     * we use an OutputFormatClass that writes the job results 
     * to a MySQL database. Beware that the following OutputFormat 
     * will only write the *key* to the database, but the principle
     * remains the same for all output formatters
     */
    job.setOutputFormatClass( DBOutputFormat.class );
    

这篇关于是否可以读取MongoDB数据,使用Hadoop处理数据,并将其输出到RDBS(MySQL)中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆