修改jupyter内核以在spark中添加cassandra连接 [英] modify jupyter kernel to add cassandra connection in spark
问题描述
我有一个与PySpark合作的Jupyter内核.
I have a Jupyter Kernel working with PySpark.
> cat kernel.json
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
"display_name":"PySpark"
}
我想修改此内核以添加与cassandra的连接.在脚本模式下,我输入:
I want to modify this kernel to add a connection to cassandra. In script mode, I type :
pyspark \
--packages anguenot:pyspark-cassandra:0.7.0 \
--conf spark.cassandra.connection.host=12.34.56.78 \
--conf spark.cassandra.auth.username=cassandra \
--conf spark.cassandra.auth.password=cassandra
脚本版本可以完美运行.但是我想在Jupyter中做同样的事情.
我应该在哪里在内核中输入这些信息?我已经尝试过两者:
Where should I input these informations in my kernel ? I already tried both :
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
"display_name":"PySpark with Cassandra",
"spark.jars.packages": "anguenot:pyspark-cassandra:0.7.0",
"spark.cassandra.connection.host": "12.34.56.78",
"spark.cassandra.auth.username": "cassandra",
"spark.cassandra.auth.password": "cassandra"
}
和
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
"display_name":"PySpark with Cassandra",
"PYSPARK_SUBMIT_ARGS": "--packages anguenot:pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 --conf spark.cassandra.auth.username=cassandra --conf spark.cassandra.auth.password=cassandra"
}
他们都没有工作.当我执行时:
None of them are working. When I execute :
sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="my_table", keyspace="my_keyspace")\
.load()
我收到错误 java.lang.ClassNotFoundException:无法找到数据源:org.apache.spark.sql.cassandra
.
仅供参考:我不是在笔记本中创建Spark会话.启动内核时, sc
对象已经存在.
FYI : I am not creating the Spark session from within the notebook. The sc
object already exists when starting the kernel.
推荐答案
spark.jars.*
选项必须先配置 SparkContext
已初始化.发生这种情况后,配置将无效.这意味着您必须执行以下操作之一:
spark.jars.*
options has to be configured before
SparkContext
has been initialized. After this happened, configuration will have no effect. This means you have to do one of the following:
- 修改
SPARK_HOME/conf/spark-defaults.conf
或SPARK_CONF_DIR/spark-defaults.conf
并确保SPARK_HOME
或SPARK_CONF_DIR
在作用域内. 使用与添加Jar中所述的方法相同的方法, - 修改内核初始化代码(在其中初始化
SparkContext
的地方)到独立的pyspark
- Modify
SPARK_HOME/conf/spark-defaults.conf
orSPARK_CONF_DIR/spark-defaults.conf
and make sure thatSPARK_HOME
orSPARK_CONF_DIR
are in the scope when kernel is started. - Modify kernel initializing code (where the
SparkContext
is initialized) using the same methods as described in Add Jar to standalone pyspark
我也强烈建议配置Spark以与Jupyter Notebook和Anaconda一起使用
这篇关于修改jupyter内核以在spark中添加cassandra连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!