sparkR与Cassandra [英] sparkR with Cassandra
问题描述
我想读取来自cassandra键空间和column_family的数据框。当运行sparkR,我调用相应的spark-cassandra连接器包,并将conf设置为我的本地spark cassandra主机。我没有得到任何错误时运行下面。
$ ./bin/sparkR --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0 -M2 --conf spark.cassandra.connection.host = 127.0.0.1
sc < - sparkR.init(master =local)
/ pre>
sqlContext< - sparkRSQL.init(sc)
people< df(sqlContext,
source =org.apache.spark.sql.cassandra,
keyspace =keyspace_name,table =table_name)
我得到以下错误:
(con,object):invalid jobj 1
我必须将conf传入
sparkContext
赋值(sc),以及如何在sparkR?
下面是我的spark和cassandra版本,
Spark:1.5.1
Cassandra:2.1.6
Cassandra Connector更新为每个zero323建议使用1.5.0-M2
这里是我的堆栈跟踪的要点。
https://gist.github.com/ bhajer3 / 419561edcb0dc5db2f71
编辑:
从不包含任何Cassandra集合数据类型(
,例如Map,Set和List)的表创建数据帧。但是,我需要数据的很多模式都包括这些集合数据类型。
因此,当读取来自Cassandra键空间和column_family的数据帧时,sparkR不支持cassandra集合数据类型。请参阅我的详细报告/测试程序。
https://gist.github.com/ bhajer3 / c3effa92de8e3cfc4fee
解决方案初始问题:
一般来说,你必须匹配Spark,
spark-cassandra-connector
和Cassandra版本。连接器版本应该与主要的Spark版本(Spark 1.5的连接器1.5,Spark 1.4的连接器1.4等)匹配。
与Cassandra版本的兼容性有点棘手,您可以在连接器README.md 中找到兼容版本的完整列表。
编辑:
SparkR< 1.6不支持收集复杂数据类型,包括数组或映射。已通过 SPARK-10049 解决。如果你构建Spark形式master,它按预期工作。没有
cassandra连接器
为1.6,但是1.5-M2似乎工作正常,至少与DataFrame API。
注意:
看起来连接器1.5-M2错误地报告了
日期
Timestamps
因此请小心,如果您在数据库中使用这些。I want to read a dataframe that comes from a cassandra keyspace and column_family. When running sparkR, I am calling the respective spark-cassandra-connector package, and setting the conf to my local spark cassandra host. I do not get any error when running the below.
$ ./bin/sparkR --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf spark.cassandra.connection.host=127.0.0.1
sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) people <-read.df(sqlContext, source = "org.apache.spark.sql.cassandra", keyspace = "keyspace_name", table = "table_name")
I get the following error,
Error in writeJobj(con, object) : invalid jobj 1
Do I have to pass conf into the
sparkContext
assignment (sc), and how in sparkR?Below is my spark and cassandra versions,
Spark: 1.5.1 Cassandra: 2.1.6 Cassandra Connector updated to use 1.5.0-M2 per zero323 advice
Here is a gist to my stack trace.
https://gist.github.com/bhajer3/419561edcb0dc5db2f71
Edit:
I am able to create data frames from tables which do not include any Cassandra collection datatypes, such as Map, Set and List. But, many of the schemas that I need data from, do include these collection data types.
Thus, sparkR does not have support for cassandra collection data types, when reading a dataframe that comes from a Cassandra keyspace and column_family. See here for my detailed report/testing procedures.
https://gist.github.com/bhajer3/c3effa92de8e3cfc4fee
解决方案The initial problem:
Generally speaking you have to match Spark,
spark-cassandra-connector
and Cassandra versions. Connector version should match major Spark version (connector 1.5 for Spark 1.5, connector 1.4 for Spark 1.4 and so on).Compatibility with Cassandra version is a little bit more tricky but you can find a full list of compatible versions in connector README.md.
Edit:
SparkR < 1.6 doesn't support collecting complex data types including arrays or maps. It has been solved by SPARK-10049. If you build Spark form master it works as expected. There is no
cassandra-connector
for 1.6 but 1.5-M2 seems to works just fine, at least with DataFrame API.Note:
It looks like connector 1.5-M2 incorrectly reports
Date
keys asTimestamps
so please beware if you use these in your database.这篇关于sparkR与Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!