获取星火,Python和MongoDB的共同努力 [英] Getting Spark, Python, and MongoDB to work together
问题描述
我有困难得到这些组件中共同编织。我安装了Spark和成功地工作,我可以通过本地纱运行的作业,独立的,而还。我按照建议的步骤(据我所知)这里和<一个href=\"https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/README.rst\">here
I'm having difficulty getting these components to knit together properly. I have Spark installed and working succesfully, I can run jobs locally, standalone, and also via YARN. I have followed the steps advised (to the best of my knowledge) here and here
我工作在Ubuntu和各种组件版本我是
I'm working on Ubuntu and the various component versions I have are
- 星火火花1.5.1彬hadoop2.6
- 的Hadoop 的Hadoop-2.6.1
- 蒙戈 2.6.10
- 蒙戈-Hadoop的连接从<克隆href=\"https://github.com/mongodb/mongo-hadoop.git\">https://github.com/mongodb/mongo-hadoop.git
- 的Python 2.7.10
- Spark spark-1.5.1-bin-hadoop2.6
- Hadoop hadoop-2.6.1
- Mongo 2.6.10
- Mongo-Hadoop connector cloned from https://github.com/mongodb/mongo-hadoop.git
- Python 2.7.10
我有以下的各种步骤,如哪些罐子增加一些难度的路径,所以我已经添加什么
I had some difficulty following the various steps such as which jars to add to which path, so what I have added are
- 在
/usr/local/share/hadoop-2.6.1/share/hadoop/ma$p$pduce
我已经加入蒙戈-Hadoop的核心1.5.0-SNAPSHOT.jar
- 下面的环境变量
-
出口HADOOP_HOME =在/ usr / local / share下/ Hadoop的2.6.1
-
出口PATH = $ PATH:$ HADOOP_HOME / bin中
-
出口SPARK_HOME =在/ usr / local / share下/火花1.5.1彬hadoop2.6
-
出口PYTHONPATH =在/ usr / local / share下/蒙戈-的Hadoop /火花/ src目录/主/蟒蛇
-
出口PATH = $ PATH:$ SPARK_HOME /斌
- in
/usr/local/share/hadoop-2.6.1/share/hadoop/mapreduce
I have addedmongo-hadoop-core-1.5.0-SNAPSHOT.jar
- the following environment variables
export HADOOP_HOME="/usr/local/share/hadoop-2.6.1"
export PATH=$PATH:$HADOOP_HOME/bin
export SPARK_HOME="/usr/local/share/spark-1.5.1-bin-hadoop2.6"
export PYTHONPATH="/usr/local/share/mongo-hadoop/spark/src/main/python"
export PATH=$PATH:$SPARK_HOME/bin
我的Python程序是基本
My Python program is basic
from pyspark import SparkContext, SparkConf import pymongo_spark pymongo_spark.activate() def main(): conf = SparkConf().setAppName("pyspark test") sc = SparkContext(conf=conf) rdd = sc.mongoRDD( 'mongodb://username:password@localhost:27017/mydb.mycollection') if __name__ == '__main__': main()
我使用的命令运行它。
I am running it using the command
$SPARK_HOME/bin/spark-submit --driver-class-path /usr/local/share/mongo-hadoop/spark/build/libs/ --master local[4] ~/sparkPythonExample/SparkPythonExample.py
和我得到以下输出结果
Traceback (most recent call last): File "/home/me/sparkPythonExample/SparkPythonExample.py", line 24, in <module> main() File "/home/me/sparkPythonExample/SparkPythonExample.py", line 17, in main rdd = sc.mongoRDD('mongodb://username:password@localhost:27017/mydb.mycollection') File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 161, in mongoRDD return self.mongoPairRDD(connection_string, config).values() File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 143, in mongoPairRDD _ensure_pickles(self) File "/usr/local/share/mongo-hadoop/spark/src/main/python/pymongo_spark.py", line 80, in _ensure_pickles orig_tb) py4j.protocol.Py4JError
According to here
当一个异常在Java客户端,就会出现此异常引发
code。例如,如果您尝试从一个空栈中弹出一个元素。
抛出的Java异常的实例被存储在
java_exception成员。This exception is raised when an exception occurs in the Java client code. For example, if you try to pop an element from an empty stack. The instance of the Java exception thrown is stored in the java_exception member.
在源$ C $ C寻找
pymongo_spark.py
和行引发错误,它说:Looking at the source code for
pymongo_spark.py
and the line throwing the error, it says的错误。是MongoDB的星火罐子上
星火的CLASSPATH?"Error while communicating with the JVM. Is the MongoDB Spark jar on Spark's CLASSPATH? : "
因此,在响应我已经试过,以确保正确的罐子被过去了,但我可能会做这一切错了,请参阅下面
So in response I have tried to be sure the right jars are being passed, but I might be doing this all wrong, see below
$SPARK_HOME/bin/spark-submit --jars /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar --driver-class-path /usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-java-driver-3.0.4.jar,/usr/local/share/spark-1.5.1-bin-hadoop2.6/lib/mongo-hadoop-spark-1.5.0-SNAPSHOT.jar --master local[4] ~/sparkPythonExample/SparkPythonExample.py
我已导入
pymongo
相同的python程序来验证我至少可以访问MongoDB的使用,我可以。I have imported
pymongo
to the same python program to verify that I can at least access MongoDB using that, and I can.我知道有相当多的运动部件这里,所以如果我能提供任何更多有用的信息,请让我知道。
I know there are quite a few moving parts here so if I can provide any more useful information please let me know.
推荐答案
更新
的 2016年3月30日的
由于原来的答案,我发现了两个不同的方法来从星火连接到MongoDB的:
Since the original answer I found two different ways to connect to MongoDB from Spark:
- mongodb/mongo-spark
- Stratio/Spark-MongoDB
虽然前者似乎是相对不成熟后者貌似比蒙戈 - Hadoop的连接器一个更好的选择,并提供了星火SQL API。
While the former one seems to be relatively immature the latter one looks like a much better choice than a Mongo-Hadoop connector and provides a Spark SQL API.
# Adjust Scala and package version according to your setup # although officially 0.11 supports only Spark 1.5 # I haven't encountered any issues on 1.6.1 bin/pyspark --packages com.stratio.datasource:spark-mongodb_2.11:0.11.0
df = (sqlContext.read .format("com.stratio.datasource.mongodb") .options(host="mongo:27017", database="foo", collection="bar") .load()) df.show() ## +---+----+--------------------+ ## | x| y| _id| ## +---+----+--------------------+ ## |1.0|-1.0|56fbe6f6e4120712c...| ## |0.0| 4.0|56fbe701e4120712c...| ## +---+----+--------------------+
这似乎是比稳定得多
蒙戈-Hadoop的火花
,不支持静态配置predicate下推,简单的工作。It seems to be much more stable than
mongo-hadoop-spark
, supports predicate pushdown without static configuration and simply works.原来的答案
实际上,这里还有相当多的运动部件。我试着更易于管理的一点点通过建立这大致匹配描述的配置(我不再赘述Hadoop的图书馆虽然)一个简单的码头工人形象做出来。您可以在
GitHub上
一发现完整的源代码>( DOI 10.5281 / zenodo.47882 ),并从头开始构建的:Indeed, there are quite a few moving parts here. I tried to make it a little bit more manageable by building a simple Docker image which roughly matches described configuration (I've omitted Hadoop libraries for brevity though). You can find complete source on
GitHub
(DOI 10.5281/zenodo.47882) and build it from scratch:git clone https://github.com/zero323/docker-mongo-spark.git cd docker-mongo-spark docker build -t zero323/mongo-spark .
或下载我推到泊坞枢纽所以你的图像可以简单地
泊坞窗拉zero323 /蒙戈火花
)or download an image I've pushed to Docker Hub so you can simply
docker pull zero323/mongo-spark
):启动图像:
docker run -d --name mongo mongo:2.6 docker run -i -t --link mongo:mongo zero323/mongo-spark /bin/bash
开始PySpark外壳通过
- 罐子
和- 驱动程序类路径
:pyspark --jars ${JARS} --driver-class-path ${SPARK_DRIVER_EXTRA_CLASSPATH}
终于看到它是如何工作的:
And finally see how it works:
import pymongo import pymongo_spark mongo_url = 'mongodb://mongo:27017/' client = pymongo.MongoClient(mongo_url) client.foo.bar.insert_many([ {"x": 1.0, "y": -1.0}, {"x": 0.0, "y": 4.0}]) client.close() pymongo_spark.activate() rdd = (sc.mongoRDD('{0}foo.bar'.format(mongo_url)) .map(lambda doc: (doc.get('x'), doc.get('y')))) rdd.collect() ## [(1.0, -1.0), (0.0, 4.0)]
请注意,蒙戈 - Hadoop的似乎是关闭的第一个动作之后的连接。因此呼吁例如
rdd.count()
Please note that mongo-hadoop seems to close the connection after the first action. So calling for example
rdd.count()
after the collect will throw an exception.>和
蒙戈-Hadoop的火花1.5.0-SNAPSHOT.jar
以两个- 罐子
和- 驱动程序类路径
是唯一的硬性要求备注
- 此图像松散的基础上 jaceklaskowski /泊坞窗火花
所以请务必一些好人缘发送到 @亚采,郭先生如果它帮助。 - 如果不要求开发版本,包括<一个href=\"https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage#python-example-unreleasedin-master-branch\"相对=nofollow>新的API 然后用
- 包
最有可能是更好的选择
- This image is loosely based on jaceklaskowski/docker-spark so please be sure to send some good karma to @jacek-laskowski if it helps.
- If don't require a development version including new API then using
--packages
is most likely a better option.
这篇关于获取星火,Python和MongoDB的共同努力的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-