无法使用Spark-Shell从EMR集群连接到远程MongoDB [英] Cannot connect to remote MongoDB from EMR cluster with spark-shell

查看:190
本文介绍了无法使用Spark-Shell从EMR集群连接到远程MongoDB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从EMR群集连接到远程Mongo数据库.以下代码是使用命令spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2执行的:

I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2:

import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._

val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user_name"), ("database", "meteor"), ("password", "my_password")))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)

Spark-shell响应并显示以下错误:

Spark-shell responds with the following error:

16/07/26 15:44:35 INFO SparkContext: Starting job: aggregate at MongodbSchema.scala:47
16/07/26 15:44:45 WARN DAGScheduler: Creating new stage failed due to exception - job: 1
com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting to connect. Client view of cluster state is {type=Unknown, servers=[{address=[IP.OF.REMOTE.HOST]:3001, type=Unknown, state=Connecting, exception={java.lang.IllegalArgumentException: response too long: 1347703880}}]
    at com.mongodb.BaseCluster.getDescription(BaseCluster.java:128)
    at com.mongodb.DBTCPConnector.getClusterDescription(DBTCPConnector.java:394)
    at com.mongodb.DBTCPConnector.getType(DBTCPConnector.java:571)
    at com.mongodb.DBTCPConnector.getReplicaSetStatus(DBTCPConnector.java:362)
    at com.mongodb.Mongo.getReplicaSetStatus(Mongo.java:446)
.
.
.

阅读一段时间后,SO和其他论坛中的一些答复指出java.lang.IllegalArgumentException: response too long: 1347703880错误可能是由错误的Mongo驱动程序引起的.基于此,我开始使用更新的驱动程序执行spark-shell,如下所示:

After reading for a while, a few responses here in SO and other forums state that the java.lang.IllegalArgumentException: response too long: 1347703880 error might be caused by a faulty Mongo driver. Based on that I started executing spark-shell with updated drivers like so:

spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 --jars casbah-commons_2.10-3.1.1.jar,casbah-core_2.10-3.1.1.jar,casbah-query_2.10-3.1.1ja.jar,mongo-java-driver-2.13.0.jar

当然,在此之前,我下载了jar并将其存储在执行spark-shell的相同路径中.尽管如此,通过这种方法,Spark-shell会回答以下神秘错误消息:

Of course before this I downloaded the jars and stored them in the same route as the spark-shell was executed. Nonetheless, with this approach spark-shell answers with the following cryptic error message:

Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: com/mongodb/casbah/query/dsl/CurrentDateOp
    at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:218)
    at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)

值得一提的是,目标MongoDB是Meteor Mongo数据库,这就是为什么我尝试使用[IP.OF.REMOTE.HOST]:3001连接而不是使用端口27017的原因.

It is worth mentioning that the target MongoDB is a Meteor Mongo database, that's why I'm trying to connect with [IP.OF.REMOTE.HOST]:3001 instead of using the port 27017.

可能是什么问题?我已经看了很多教程,但是所有教程似乎都在同一主机中拥有MongoDB,从而允许他们在凭据中声明localhost:27017.有什么我想念的吗?

What might be the issue? I've followed many tutorials but all of them seem to have the MongoDB in the same host, allowing them to declare localhost:27017 in the credentials. Is there something I'm missing?

感谢您的帮助!

推荐答案

我最终使用了MongoDB的官方Java驱动程序.这是我第一次使用Spark和Scala编程语言,因此我对使用纯Java JAR的想法还不是很熟悉.

I ended up using MongoDB's official Java driver instead. This was my first experience with Spark and the Scala programming language, so I wasn't very familiar with the idea of using plain Java JARs yet.

我下载了必要的JAR,并将它们存储在与作业文件(Scala文件)相同的目录中.所以目录看起来像这样:

I downloaded the necessary JARs and stored them in the same directory as the job file, which is a Scala file. So the directory looked something like:

/job_directory
|--job.scala
|--bson-3.0.1.jar
|--mongodb-driver-3.0.1.jar
|--mongodb-driver-core-3.0.1.jar

然后,按以下方式启动spark-shell,将JAR及其类加载到shell环境中:

Then, I start spark-shell as follows to load the JARs and its classes into the shell environment:

spark-shell --jars "mongodb-driver-3.0.1.jar,mongodb-driver-core-3.0.1.jar,bson-3.0.1.jar"

接下来,我执行以下操作以将作业的源代码加载到spark-shell中:

Next, I execute the following to load the source code of the job into the spark-shell:

:load job.scala

最后,我在工作中执行主对象,如下所示:

Finally I execute the main object in my job like so:

MainObject.main(Array())

从MainObject内部的代码开始,它仅作为教程状态:

As of the code inside the MainObject, it is merely as the tutorial states:

val mongo = new MongoClient(IP_OF_REMOTE_MONGO , 27017)
val db = mongo.getDB(DB_NAME)

希望这将对未来的读者和Spark-shell/Scala初学者有所帮助!

Hopefully this will help future readers and spark-shell/Scala beginners!

这篇关于无法使用Spark-Shell从EMR集群连接到远程MongoDB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆