星火卡桑德拉连接器 - 在分区键范围查询 [英] Spark Cassandra connector - Range query on partition key

查看:171
本文介绍了星火卡桑德拉连接器 - 在分区键范围查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我评估火花卡桑德拉连接器,我挣扎着试图让分区键范围查询工作。

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.

根据连接器的文件似乎是可能使服务器端过滤功能分区重点用平等,或运营商,但不幸的是,我的分区键是一个时间戳,所以我不能使用它。

According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.

所以,我尝试使用SQL星火用下面的查询('戳'是分区键):

So I tried using Spark SQL with the following query ('timestamp' is the partition key):

select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'

虽然作业派生了200任务时,查询不返回任何数据。

Although the job spawns 200 tasks, the query is not returning any data.

我也可以保证,没有要返回的数据,因为上运行cqlsh查询不会返回数据(使用标记功能做适当的转换)。

Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.

我使用的火花1.1.0具有独立模式。 Cassandra是2.1.2和连接器版本是B1.1分支。卡桑德拉司机DataStax'主人'分支。
卡桑德拉集群叠加火花集群与1复制因子3服务器。

I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch. Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.

这里是工作的完整的日志

任何线索任何人吗?

更新::当试图做服务器端筛选的基础上分区键(使用CassandraRDD.where方法)我得到了以下异常:

Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:

Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.

但不幸的是,我不知道什么是过滤器是...

But unfortunately I don't know what "filter" is...

推荐答案

您有几种选择,让你正在寻找解决方案。

You have several options to get the solution you are looking for.

最强大的人会使用由Stratio,它允许你用任何索引字段在服务器端搜索与卡桑德拉集成Lucene的索引。你的写作时间会增加,但是,在另一方面,你将能够查询任意时间段。你可以在这里找到关于卡桑德拉 Lucene索引的进一步信息。此卡珊德拉的扩展版本完全整合到深星火项目让你带了Lucene索引的所有的优势,卡桑德拉通过它。我会建议你使用Lucene的索引时,您正在执行的检索中小结果集的限制查询,如果你要取回一块大的数据集,你应该下使用第三个选项。

The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.

另一种方法,这取决于你的应用程序是如何工作的,可能是您的截断时间戳字段,所以你可以看看使用IN操作它。问题是,据我所知,你不能使用火花卡桑德拉连接器对于这一点,你应该使用不与星火集成了直接驱动卡桑德拉,或者你可以看看深星火项目其中,一项新功能,允许这是即将很快被释放。您的查询会是这个样子:

Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:

select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')

,但是,正如我之前说的,我不知道这是否符合您的需要,因为你可能无法按日期/时间截断你的数据和组了。

, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.

您的最后一个选项,但效率较低,是将完整的数据设置为您的火花集群,在RDD应用过滤器。

The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.

免责声明:我对Stratio :-)不要,如果您需要任何帮助联系我们毫不犹豫工作

Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.

我希望它能帮助!

这篇关于星火卡桑德拉连接器 - 在分区键范围查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆