在Cassandra中查询大型数据集 [英] Querying Large Datasets in Cassandra

查看:247
本文介绍了在Cassandra中查询大型数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的经验是一个RDBMS程序员。我正在研究涉及基因组数据的科学研究问题。我被指派探索Cassandra,因为我们需要一个大数据,可扩展和廉价(免费)解决方案。设置Cassandra和加载数据是诱人的微不足道,类似于我的传统DB像Oracle和MySQL的经验。我的问题是找到一个简单的查询数据的策略,因为这是所有数据存储库的基本要求。我正在使用的数据是包含位置信息以及关于数据的计算数值度量的突变数据集。我设置了一个初始静态列族,如下所示:

  CREATE TABLE variant(
chrom text,
pos int,
ref text,
alt text,
aa text,
ac int,
af float,
afr_af text,
amr_af text,
an int,
asn_af text,
avgpost text,
erate text,
eur_af text,
ldaf text,
mutation_id text,
patient_id int,
rsq text,
snpsource text,
theta text,
vt text,
PRIMARY KEY(chrom,pos,ref ,alt)
)WITH
bloom_filter_fp_chance = 0.010000 AND
caching ='KEYS_ONLY'AND
comment =''AND
dclocal_read_repair_chance = 0.000000 AND
gc_grace_seconds = 864000 AND
read_repair_chance = 0.100000 AND
replicate_on_write ='true'AND
populate_io_cache_on_flush ='false'AND
compaction = {'class':'SizeTieredCompactionStrategy'} AND
compression = {'sstable_compression':'SnappyCompressor'};

CREATE INDEX af_variant_idx ON variant(af);

正如你所看到的,有一个位置数据的自然主键(chrome,pos,ref和alt )。从查询的角度来看,此数据没有意义。我的客户目前更有趣的是提取AF值低于某个值的数据。我使用Java restful服务使用CQL JDBC驱动程序与此数据库交互。很快显而易见的是,直接查询此表将无法使用AF,因为看起来select语句必须标识要查看的行键。我发现了一些令人困惑的讨论,但是我决定做的是,由于AF的不同值低于100值,我建立了一个查找表,看起来像这样:

  CREATE TABLE af_lookup(
af_id float,
column1 text,
column2 text,
value text,
PRIMARY KEY ,column1,column2)
)WITH COMPACT STORAGE和
bloom_filter_fp_chance = 0.010000 AND
caching ='KEYS_ONLY'AND
comment =''AND
dclocal_read_repair_chance = 0.000000 AND
gc_grace_seconds = 864000 AND
read_repair_chance = 0.100000 AND
replicate_on_write ='true'AND
populate_io_cache_on_flush ='false'AND
compaction = {'class':'SizeTieredCompactionStrategy'} AND
compression = {'sstable_compression':'SnappyCompressor'};

这意味着是一个非常宽行的动态表。我根据这些数据存储在我的静态列族填充此表。 AF值是键,来自其他表的复合键通过 - 连接(即,1-129-T-G),并作为动态列名存储为字符串。这工作确定,但我仍然不明白如何所有这些事情一起工作。动态列家族似乎只能像使用CQL -2的广告一样工作,但我真的需要使用像,,<> =,< =。看起来这在理论上是可能的,但我没有找到一个解决方案,在最近4个星期尝试了一些不同的工具(我尝试astyanax以及JDBC驱动程序)。



我有两个主要问题,第一个是rpc超时限制查询这些数据,可以产生10万到数百万条记录。第二个问题是如何通过获取尚未呈现的数据(上一个 - 下一个链接)来将这些数据呈现在HTML中。类似于opscenter显示列族记录数据的方式。这似乎不可能与功能限制不能使用>,<> =,< =。根据我的经验,这可能是我对如何这个产品真正工作的理解,而不是缺乏产品的能力(数据库将不是非常有用,如果他们只能处理写得很好)。

有没有人遇到这个问题并解决了吗?我真的很感谢分享一个例子,如何实现一个C *解决方案使用java web服务显示大量的结果,必须通过分页。

解决方案

您可能想探索并使用Playorm for Cassandra,因为它可以解决您的时间限制和分页问题。当您查询时,PlayOrm会返回一个游标,当您的第一个页面读取前20个结果并显示时,下一页可以在会话中使用相同的游标,并且它在没有重新扫描前20行的情况下, 。

访问 http://buffalosw.com / wiki / An-example-to-begin-with-PlayOrm / 查看光标和的示例http://buffalosw.com/products/playorm/ 了解关于playorm的所有功能和更多详情


I am by experience a RDBMS programmer. I am working on a scientific research problem involving genomic data. I was assigned to explore Cassandra since we needed a Big Data, scalable and cheap (free) solution. Setting Cassandra up and loading it with data was seductively trivial and similar to my experience with traditional DBs like Oracle and MySQL. My problem is finding a simple strategy to query data since this is a fundamental requirement for all data repositories. The data I am working with is mutation datasets which contain positional information as well as calculated numerical measures regarding the data. I set up an initial static column family that looks like this:

CREATE TABLE variant (
chrom text,
pos int,
ref text,
alt text,
aa text,
ac int,
af float,
afr_af text,
amr_af text,
an int,
asn_af text,
avgpost text,
erate text,
eur_af text,
ldaf text,
mutation_id text,
patient_id int,
rsq text,
snpsource text,
theta text,
vt text,
PRIMARY KEY (chrom, pos, ref, alt)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};

CREATE INDEX af_variant_idx ON variant (af);

As you can see there is a natural primary key of positional data (chrome, pos, ref and alt). This data is not meaningful from a querying point of view. Much more interesting to my clients currently is to extract data with an 'AF' value below a certain value. I am using Java restful services to interact with this database using the CQL JDBC driver. It quickly became apparent that directly querying this table would not work using AF since it seems like the select statement must identify the row keys that you want to look at. I found some confusing discussions on this point but what I decided to do was since the distinct values of AF are below 100 values, I built a lookup table that looks like this:

CREATE TABLE af_lookup (
  af_id float,
  column1 text,
  column2 text,
  value text,
  PRIMARY KEY (af_id, column1, column2)
 ) WITH COMPACT STORAGE AND
 bloom_filter_fp_chance=0.010000 AND
 caching='KEYS_ONLY' AND
 comment='' AND
 dclocal_read_repair_chance=0.000000 AND
 gc_grace_seconds=864000 AND
 read_repair_chance=0.100000 AND
 replicate_on_write='true' AND
 populate_io_cache_on_flush='false' AND
 compaction={'class': 'SizeTieredCompactionStrategy'} AND
 compression={'sstable_compression': 'SnappyCompressor'};

This was meant to be a dynamic table with very wide rows. I populated this table based on those data stored on my static column family. The 'AF' value is the key and the compound key from the other table is concantenate by '-' (i.e.1-129-T-G) and stored as a string as a dynamic column name. This worked OK but I still do not understand how all of these things work together. Dynamic Column Families seem to only work as advertised using CQL -2 but I really need to utilize function like >, <, >=, <=. It seems like this is theoretically possible but I have not found a solution in the last 4 weeks of trying a number of different tools (I tried astyanax as well as the JDBC driver).

I have two primary problems, the first is the rpc timeout limitation for querying these data which could produce 10 of thousands to millions of records. The second problem is how to present these data in HTML by getting the data that has not been presented already (previous - next links). Similar to the way opscenter displays column family record data. This doesn't seem possible with the functional limitations of not being able to use >, <, >=, <=. Based on my experience this is probably a lack of understanding on my part of how this product really works rather than a lack of capability of the product (databases wouldn't be very useful if they were only capable of handling writes well).

Is there anyone out there that has encountered this issue and solved it before? I would really appreciate sharing an example of how to implement a C* solution using java web services to display a large number of results that will have to be paginated through.

解决方案

You may want to explore and use Playorm for Cassandra as it can resolve your problem of timout limitation and pagination. PlayOrm returns a cursor when you query and as your first page reads in the first 20 results and displays it, the next page can just use the same cursor in your session and it picks up right where it left off without rescanning the first 20 rows again.
Visit http://buffalosw.com/wiki/An-example-to-begin-with-PlayOrm/ to see the example for cursor and http://buffalosw.com/products/playorm/ for all features and more details about playorm

这篇关于在Cassandra中查询大型数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆