sparkR与Cassandra [英] sparkR with Cassandra

查看:282
本文介绍了sparkR与Cassandra的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取来自cassandra键空间和column_family的数据框。当运行sparkR,我调用相应的spark-cassandra连接器包,并将conf设置为我的本地spark cassandra主机。我没有得到任何错误时运行下面。

  $ ./bin/sparkR --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0 -M2 --conf spark.cassandra.connection.host = 127.0.0.1 





  sc < -  sparkR.init(master =local)
sqlContext< - sparkRSQL.init(sc)
people< df(sqlContext,
source =org.apache.spark.sql.cassandra,
keyspace =keyspace_name,table =table_name)
/ pre>

我得到以下错误:

  (con,object):invalid jobj 1 

我必须将conf传入 sparkContext 赋值(sc),以及如何在sparkR?



下面是我的spark和cassandra版本,



Spark:1.5.1
Cassandra:2.1.6
Cassandra Connector更新为每个zero323建议使用1.5.0-M2



这里是我的堆栈跟踪的要点。



https://gist.github.com/ bhajer3 / 419561edcb0dc5db2f71



编辑



从不包含任何Cassandra集合数据类型(
,例如Map,Set和List)的表创建数据帧。但是,我需要数据的很多模式都包括这些集合数据类型。



因此,当读取来自Cassandra键空间和column_family的数据帧时,sparkR不支持cassandra集合数据类型。请参阅我的详细报告/测试程序。



https://gist.github.com/ bhajer3 / c3effa92de8e3cfc4fee

解决方案

初始问题



一般来说,你必须匹配Spark, spark-cassandra-connector 和Cassandra版本。连接器版本应该与主要的Spark版本(Spark 1.5的连接器1.5,Spark 1.4的连接器1.4等)匹配。



与Cassandra版本的兼容性有点棘手,您可以在连接器README.md 中找到兼容版本的完整列表。



编辑



SparkR< 1.6不支持收集复杂数据类型,包括数组或映射。已通过 SPARK-10049 解决。如果你构建Spark形式master,它按预期工作。没有 cassandra连接器为1.6,但是1.5-M2似乎工作正常,至少与DataFrame API。



注意



看起来连接器1.5-M2错误地报告了日期 Timestamps 因此请小心,如果您在数据库中使用这些。


I want to read a dataframe that comes from a cassandra keyspace and column_family. When running sparkR, I am calling the respective spark-cassandra-connector package, and setting the conf to my local spark cassandra host. I do not get any error when running the below.

$ ./bin/sparkR --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf spark.cassandra.connection.host=127.0.0.1

sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
people <-read.df(sqlContext,
    source = "org.apache.spark.sql.cassandra",
    keyspace = "keyspace_name", table = "table_name")

I get the following error,

Error in writeJobj(con, object) : invalid jobj 1

Do I have to pass conf into the sparkContext assignment (sc), and how in sparkR?

Below is my spark and cassandra versions,

Spark: 1.5.1 Cassandra: 2.1.6 Cassandra Connector updated to use 1.5.0-M2 per zero323 advice

Here is a gist to my stack trace.

https://gist.github.com/bhajer3/419561edcb0dc5db2f71

Edit:

I am able to create data frames from tables which do not include any Cassandra collection datatypes, such as Map, Set and List. But, many of the schemas that I need data from, do include these collection data types.

Thus, sparkR does not have support for cassandra collection data types, when reading a dataframe that comes from a Cassandra keyspace and column_family. See here for my detailed report/testing procedures.

https://gist.github.com/bhajer3/c3effa92de8e3cfc4fee

解决方案

The initial problem:

Generally speaking you have to match Spark, spark-cassandra-connector and Cassandra versions. Connector version should match major Spark version (connector 1.5 for Spark 1.5, connector 1.4 for Spark 1.4 and so on).

Compatibility with Cassandra version is a little bit more tricky but you can find a full list of compatible versions in connector README.md.

Edit:

SparkR < 1.6 doesn't support collecting complex data types including arrays or maps. It has been solved by SPARK-10049. If you build Spark form master it works as expected. There is no cassandra-connector for 1.6 but 1.5-M2 seems to works just fine, at least with DataFrame API.

Note:

It looks like connector 1.5-M2 incorrectly reports Date keys as Timestamps so please beware if you use these in your database.

这篇关于sparkR与Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆