修改jupyter内核以在spark中添加cassandra连接 [英] modify jupyter kernel to add cassandra connection in spark

查看:121
本文介绍了修改jupyter内核以在spark中添加cassandra连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与PySpark合作的Jupyter内核.

I have a Jupyter Kernel working with PySpark.

> cat kernel.json
{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark"
}

我想修改此内核以添加与cassandra的连接.在脚本模式下,我输入:

I want to modify this kernel to add a connection to cassandra. In script mode, I type :

pyspark \
    --packages anguenot:pyspark-cassandra:0.7.0 \
    --conf spark.cassandra.connection.host=12.34.56.78 \
    --conf spark.cassandra.auth.username=cassandra \
    --conf spark.cassandra.auth.password=cassandra

脚本版本可以完美运行.但是我想在Jupyter中做同样的事情.

我应该在哪里在内核中输入这些信息?我已经尝试过两者:

Where should I input these informations in my kernel ? I already tried both :

{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark with Cassandra",
 "spark.jars.packages": "anguenot:pyspark-cassandra:0.7.0",
 "spark.cassandra.connection.host": "12.34.56.78",
 "spark.cassandra.auth.username": "cassandra",
 "spark.cassandra.auth.password": "cassandra"
}

{"argv":["python","-m","sparkmagic.kernels.pysparkkernel.pysparkkernel", "-f", "{connection_file}"],
 "display_name":"PySpark with Cassandra",
 "PYSPARK_SUBMIT_ARGS": "--packages anguenot:pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 --conf spark.cassandra.auth.username=cassandra --conf spark.cassandra.auth.password=cassandra"
}

他们都没有工作.当我执行时:

None of them are working. When I execute :

sqlContext.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="my_table", keyspace="my_keyspace")\
    .load()

我收到错误 java.lang.ClassNotFoundException:无法找到数据源:org.apache.spark.sql.cassandra .

仅供参考:我不是在笔记本中创建Spark会话.启动内核时, sc 对象已经存在.

FYI : I am not creating the Spark session from within the notebook. The sc object already exists when starting the kernel.

推荐答案

spark.jars.* 选项必须先配置 SparkContext 已初始化.发生这种情况后,配置将无效.这意味着您必须执行以下操作之一:

spark.jars.* options has to be configured before SparkContext has been initialized. After this happened, configuration will have no effect. This means you have to do one of the following:

  • Modify SPARK_HOME/conf/spark-defaults.conf or SPARK_CONF_DIR/spark-defaults.conf and make sure that SPARK_HOME or SPARK_CONF_DIR are in the scope when kernel is started.
  • Modify kernel initializing code (where the SparkContext is initialized) using the same methods as described in Add Jar to standalone pyspark

我也强烈建议配置Spark以与Jupyter Notebook和Anaconda一起使用

这篇关于修改jupyter内核以在spark中添加cassandra连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆