首页
数据库
是否可以读取MongoDB数据，使用Hadoop处理数据，并将其输出到RDBS（MySQL）中？

是否可以读取MongoDB数据，使用Hadoop处理数据，并将其输出到RDBS（MySQL）中？ [英] Is it possible to read MongoDB data, process it with Hadoop, and output it into a RDBS (MySQL)?

查看：1103 发布时间：2018/5/31 18:46:49 mysql mongodb hadoop sqoop

本文介绍了是否可以读取MongoDB数据，使用Hadoop处理数据，并将其输出到RDBS（MySQL）中？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

总结：

有可能：

使用MongoDB Connector for Hadoop将数据导入Hadoop中。

使用Hadoop MapReduce处理数据。

使用Sqoop导出

我正在用MongoDB构建一个Web应用程序。尽管MongoDB在大多数工作中都能很好地工作，但在某些部分我需要更强大的事务保证，为此我使用MySQL数据库。

我的问题是我想阅读一个用于数据分析的大型MongoDB集合，但收集的大小意味着分析工作需要很长时间才能处理。不幸的是，MongoDB内置的map-reduce框架对于这项工作并不适用，所以我宁愿使用 Apache Hadoop 。

据我所知，通过使用«MongoDB Connector for Hadoop»，它从MongoDB读取数据，在Hadoop中使用MapReduce处理数据，最后将结果输出回MongoDB数据库。

问题是我想让MapReduce的输出进入MySQL数据库而不是MongoDB，因为结果必须与其他MySQL表格合并。

为此，我知道Sqoop可以将Hadoop MapReduce的结果导出到MySQL中。

最终，我想要读取MongoDB数据，然后用Hadoop处理它，并最终将结果输出到MySQL数据库中。

这可能吗？哪些工具可以做到这一点？

解决方案

TL; DR：设置一个在你的Hadoop作业中写入RDBS的输出格式化程序：

job.setOutputFormatClass（DBOutputFormat.class）;

有几件事值得注意：

使用Sqoop将数据从MongoDB导出到Hadoop是不可能的。这是因为Sqoop使用提供调用级API的 JDBC for 基于SQL的数据库，但MongoDB 不是基于SQL的数据库。您可以查看«MongoDB Connector for Hadoop»来完成这项工作。该连接器可在 GitHub 上获得。（编辑：正如你在更新中指出的那样）。

默认情况下，Sqoop导出不是在单个事务中进行的。相反，根据 Sqoop文档：

由于Sqoop将导出过程分解为多个事务，导致失败的导出作业可能导致部分数据被提交到数据库。这可能进一步导致后续作业由于在某些情况下插入冲突而失败，或导致其他数据中的重复数据。您可以通过 - staging-table 选项指定临时表来克服此问题，该选项充当用于分阶段导出数据的辅助表。最后，将暂存的数据移动到单个事务中的目标表中。

Hadoop的MongoDB连接器不包含似乎强制你描述的工作流程。根据文档：

这种连接的形式允许将MongoDB数据读入Hadoop（用于MapReduce作业以及其他Hadoop生态系统的组件），以及将Hadoop作业的结果写入MongoDB。

的确，据我所知，从«MongoDB Connector for Hadoop»：示例，可以指定一个 org.apache.hadoop.mapred.lib.db.DBOutputFormat 添加到您的Hadoop MapReduce作业中，将输出写入MySQL数据库。在连接器存储库的示例之后：
job.setMapperClass（TokenizerMapper.class）; job.setCombinerClass（IntSumReducer.class）; job.setReducerClass（IntSumReducer.class）; job.setOutputKeyClass（Text.class）; job.setOutputValueClass（IntWritable.class）; job.setInputFormatClass（MongoInputFormat.class）; / *而不是： * job.setOutputFormatClass（MongoOutputFormat.class）; *我们使用OutputFormatClass将作业结果 *写入MySQL数据库。请注意，以下OutputFormat *仅将* key *写入数据库，但原则 *对所有输出格式化程序保持不变 * / job.setOutputFormatClass（ DBOutputFormat.class）;

Summary:

Is it possible to:

Import data into Hadoop with the «MongoDB Connector for Hadoop».

Process it with Hadoop MapReduce.

Export it with Sqoop in a single transaction.

I am building a web application with MongoDB. While MongoDB work well for most of the work, in some parts I need stronger transactional guarantees, for which I use a MySQL database.

My problem is that I want to read a big MongoDB collection for data analysis, but the size of the collection means that the analytic job would take too long to process. Unfortunately, MongoDB's built-in map-reduce framework would not work well for this job, so I would prefer to carry out the analysis with Apache Hadoop.

I understand that it is possible read data from MongoDB into Hadoop by using the «MongoDB Connector for Hadoop», which reads data from MongoDB, processes it with MapReduce in Hadoop, and finally outputs the results back into a MongoDB database.

The problem is that I want the output of the MapReduce to go into a MySQL database, rather than MongoDB, because the results must be merged with other MySQL tables.

For this purpose I know that Sqoop can export result of a Hadoop MapReduce into MySQL.

Ultimately, I want too read MongoDB data then process it with Hadoop and finally output the result into a MySQL database.

Is this possible? Which tools are available to do this?
解决方案

TL;DR: Set an an output formatter that writes to a RDBS in your Hadoop job:
job.setOutputFormatClass( DBOutputFormat.class );

Several things to note:

Exporting data from MongoDB to Hadoop using Sqoop is not possible. This is because Sqoop uses JDBC which provides a call-level API for SQL-based database, but MongoDB is not an SQL-based database. You can look at the «MongoDB Connector for Hadoop» to do this job. The connector is available on GitHub. (Edit: as you point out in your update.)

Sqoop exports are not made in a single transaction by default. Instead, according to the Sqoop docs:

Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.

The «MongoDB Connector for Hadoop» does not seem to force the workflow you describe. According to the docs:

This connectivity takes the form of allowing both reading MongoDB data into Hadoop (for use in MapReduce jobs as well as other components of the Hadoop ecosystem), as well as writing the results of Hadoop jobs out to MongoDB.

Indeed, as far as I understand from the «MongoDB Connector for Hadoop»: examples, it would be possible to specify a org.apache.hadoop.mapred.lib.db.DBOutputFormat into your Hadoop MapReduce job to write the output to a MySQL database. Following the example from the connector repository:
job.setMapperClass( TokenizerMapper.class ); job.setCombinerClass( IntSumReducer.class ); job.setReducerClass( IntSumReducer.class ); job.setOutputKeyClass( Text.class ); job.setOutputValueClass( IntWritable.class ); job.setInputFormatClass( MongoInputFormat.class ); /* Instead of: * job.setOutputFormatClass( MongoOutputFormat.class ); * we use an OutputFormatClass that writes the job results * to a MySQL database. Beware that the following OutputFormat * will only write the *key* to the database, but the principle * remains the same for all output formatters */ job.setOutputFormatClass( DBOutputFormat.class );

这篇关于是否可以读取MongoDB数据，使用Hadoop处理数据，并将其输出到RDBS（MySQL）中？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

相关文章

Hadoop HDFS MapReduce输出到MongoDb;

如何查询我的数据库并将其输出到Google表格上，以便某人可以编辑此数据，然后将其反映回MySQL?;

使用angularjs提取数据并将其插入mysql;

读取excel文件并将其存储到mysql数据库;

如何使用phpexcel读取数据并将其插入数据库?;

如何从XML文件中读取数据并将其存储到数据库（MySQL）中？;

读取XML并将其写入数据库;

使用inputStream和OutputStream读取数据并将其写入进程;

AWK是否可以读取时间字段并将其用于排序?;

从文件中读取数据并将其写入链接列表;

从文件读取数据并将其存储到向量中;

获取MySql数据并将其存储到Javascript数组中;

是否可以选择数据表中的所有数据并将其存储在数组中？;

从mysql中检索数据并将其显示在listView中;

将Apache Hadoop数据输出存储到Mysql数据库;

从mysql检索数据并将其放在JTables上;

从CSV文件读取数据并将其显示在JTable中;

如何完全读取图像数据并将其保存在数据库中?;

捕获图像并将其保存在mysql数据库中;

如何在Ubuntu中从终端读取数据并将数据插入MySQL;

使用Electron&从目录读取文件名并将其输出到JSON的方法反应;

如何从数据库中读取数据并将其显示在 PyQt 表中;

读取图像标题并将其保存到数据库;

如何读取xml文件并将其存储在数据集中;

Google Form输出到远程MySQL数据库;

数据库最新文章

在枚举字段上设置索引;

Web应用程序[]似乎已经启动了一个名为[Abandoned connection cleanup thread] com.mysql.jdbc.AbandonedConnectionCleanupThread的线程.;

从MySQL中的存储过程打印调试信息;

'Microsoft.ACE.OLEDB.16.0'提供程序未在本地计算机上注册。（System.Data）;

导致ORA-01790：表达式的联合必须具有与对应表达式相同的数据类型;

拒绝访问;您需要(至少其中一种)SUPER权限才能执行此操作;

SELECT列表不在GROUP BY子句中，并且包含非聚集列....与sql_mode = only_full_group_by不兼容;

Mongodb：基于ISODate格式的时间查询;

django.core.exceptions.ImproperlyConfigured：加载MySQLdb模块时出错：没有名为MySQLdb的模块;

原因：PreparedStatementCallback;错误的SQL语法;

热门教程

Java教程

Apache ANT 教程

Kali Linux教程

JavaScript教程

JavaFx教程

MFC 教程

Apache HTTP客户端教程

Microsoft Visio 教程

热门工具

Java 在线工具

C(GCC) 在线工具

PHP 在线工具

C# 在线工具

Python 在线工具

MySQL 在线工具

VB.NET 在线工具

Lua 在线工具

Oracle 在线工具

C++(GCC) 在线工具

Go 在线工具

Fortran 在线工具

登录关闭

扫码关注1秒登录

发送“验证码”获取 | 15天全站免登陆

友情链接： IT屋 Chrome插件谷歌浏览器插件

IT屋 ©2016-2022 琼ICP备2021000895号-1 站点地图站点标签 SiteMap <免责申明> 本站内容来源互联网,如果侵犯您的权益请联系我们删除.