Spark 与 Cassandra python 设置 [英] Spark with Cassandra python setup

查看:56
本文介绍了Spark 与 Cassandra python 设置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 spark 对 Cassandra 表进行一些简单的计算,但我很迷茫.

I am trying to use spark to do some simple computations on Cassandra tables, but I am quite lost.

我正在尝试关注:https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md

所以我正在运行 PySpark shell:使用

So I'm running the PySpark shell: with

./bin/pyspark \
  --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3

但我不确定如何从这里进行设置.如何让 Spark 知道我的 Cassandra 集群在哪里?我已经看到 CassandraSQLContext 可以用于此,但我也读到这已被弃用.

But I am not sure how to set things up from here. How do I let Spark know where my Cassandra cluster is? I've seen that CassandraSQLContext can be used for this, but I also read that this is deprecated.

我已阅读:如何连接火花与 cassandra 一起使用 spark-cassandra-connector?

但是如果我使用

import com.datastax.spark.connector._

Python 说它找不到模块.有人可以指出我如何正确设置的正确方向吗?

Python says that it can't find the module. Can someone maybe point me in the right direction on how to set things up properly?

推荐答案

Cassandra 连接器不提供任何 Python 模块.数据源 API 并且只要需要的 jars 存在,一切都应该开箱即用.

Cassandra connector doesn't provide any Python modules. All functionality is provided with Data Source API and as long as required jars are present, everything should work out of the box.

如何让 Spark 知道我的 Cassandra 集群在哪里?

How do I let Spark know where my Cassandra cluster is?

使用 spark.cassandra.connection.host 属性.例如,您可以将其作为 spark-submit/pyspark 的参数传递:

Use spark.cassandra.connection.host property. You can for exampel pass it as an argument for spark-submit / pyspark:

pyspark ... --conf spark.cassandra.connection.host=x.y.z.v

或在您的配置中设置:

(SparkSession.builder
    .config("cassandra.connection.host", "x.y.z.v"))

可以直接在阅读器上设置表名或键空间等配置:

Configuration like table name or keyspace can be set directly on reader:

(spark.read
    .format("org.apache.spark.sql.cassandra")
    .options(table="kv", keyspace="test", cluster="cluster")
    .load())

所以你可以关注 Dataframes文档.

So you can follows Dataframes documentation.

附注

import com.datastax.spark.connector._

是一种 Scala 语法,在 Python 中只是偶然被接受.

is a Scala syntax and is accepted in Python only accidentally.

这篇关于Spark 与 Cassandra python 设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆