卡桑德拉全表扫描中的问题 [英] Issue in full table scan in cassandra
问题描述
首先:我知道在Cassandra中进行全面扫描不是一个好主意,但是,目前,这是我所需要的.
First: I know isn't a good idea do a full scan in Cassandra, however, at moment, is that what I need.
当我开始寻找这样的东西时,我读到人们说不可能在Cassandra中进行全面扫描,而他并不是被迫做这种事情的.
When I started look for do someting like this I read people saying wasn't possible do a full scan in Cassandra and he wasn't made to do this type of thing.
不满意,我一直在寻找直到找到这篇文章为止: http://www.myhowto.org/bigdata/2013/11/04/扫描整个cassandra-column-family-with-cql/
Not satisfied, I keep looking until I found this article: http://www.myhowto.org/bigdata/2013/11/04/scanning-the-entire-cassandra-column-family-with-cql/
看起来很合理,我尝试了一下.因为我将只执行一次完整扫描,并且时间和性能都不是问题,所以我编写了查询并将其放在一个简单的Job中,以查找所需的所有记录.从20亿行记录中,我的预期输出约为1000,但是,我只有100条记录.
Look like pretty reasonable and I gave it a try. As I will do this full scan only once and time and performance isn't a issue, I wrote the query and put this in a simple Job to lookup all the records that I want. From 2 billions rows of records, something like 1000 was my expected output, however, I had only 100 records.
我的工作:
public void run() {
Cluster cluster = getConnection();
Session session = cluster.connect("db");
LOGGER.info("Starting ...");
boolean run = true;
int print = 0;
while ( run ) {
if (maxTokenReached(actualToken)) {
LOGGER.info("Max Token Reached!");
break;
}
ResultSet resultSet = session.execute(queryBuilder(actualToken));
Iterator<Row> rows = resultSet.iterator();
if ( !rows.hasNext()){
break;
}
List<String> rowIds = new ArrayList<String>();
while (rows.hasNext()) {
Row row = rows.next();
Long leadTime = row.getLong("my_column");
if (myCondition(myCollumn)) {
String rowId = row.getString("key");
rowIds.add(rowId);
}
if (!rows.hasNext()) {
Long token = row.getLong("token(rowid)");
if (!rowIds.isEmpty()) {
LOGGER.info(String.format("Keys found! RowId's: %s ", rowIds));
}
actualToken = nextToken(token);
}
}
}
LOGGER.info("Done!");
cluster.shutdown();
}
public boolean maxTokenReached(Long actualToken){
return actualToken >= maxToken;
}
public String queryBuilder(Long nextRange) {
return String.format("select token(key), key, my_column from mytable where token(key) >= %s limit 10000;", nextRange.toString());
}
public Long nextToken(Long token){
return token + 1;
}
基本上,我要做的是搜索允许的最小令牌,并逐渐增加到最后一个.
Basically what I do is search for the min token allowed and incrementally go until the last.
我不知道,但是就像这项工作没有完全进行完全扫描,或者我的查询仅访问了一个节点或其他内容.我不知道我是在做错什么,还是真的不可能进行全面扫描.
I don't know, but is like the job had not done the full-scan totally or my query had only accessed only one node or something. I don't know if I'm doing something wrong, or is not really possible do a full scan.
今天,我有将近2 TB的数据,只有一个表包含七个节点.
Today I have almost 2 TB of data, only one table in one cluster of seven nodes.
有人已经遇到这种情况或有什么建议吗?
Someone already has been in this situation or have some recommendation?
推荐答案
在Cassandra中进行全表扫描绝对是可能的-实际上,对于Spark这样的事情来说,这是相当普遍的.但是,它通常不是快速"的,因此不建议这样做,除非您知道为什么要这样做.对于您的实际问题:
It's definitely possible to do a full table scan in Cassandra - indeed, it's quite common for things like Spark. However, it's not typically "fast", so it's discouraged unless you know why you're doing it. For your actual questions:
1)如果您使用的是CQL,则几乎可以肯定使用的是Murmur3分区程序,因此最小令牌为-9223372036854775808(最大令牌为9223372036854775808).
1) If you're using CQL, you're almost certainly using Murmur3 partitioner, so your minimum token is -9223372036854775808 (and maximum token is 9223372036854775808).
2)您正在使用session.execute(),它将使用默认一致性ONE,这可能不会返回群集中的所有结果,尤其是如果您还在使用ONE编写时,我怀疑您或许.将其提高为ALL,并使用准备好的语句来加速CQL解析:
2) You're using session.execute(), which will use a default consistency of ONE, which may not return all of the results in your cluster, especially if you're also writing at ONE, which I suspect you may be. Raise that to ALL, and use prepared statements to speed up the CQL parsing:
public void run() {
Cluster cluster = getConnection();
Session session = cluster.connect("db");
LOGGER.info("Starting ...");
actualToken = -9223372036854775808;
boolean run = true;
int print = 0;
while ( run ) {
if (maxTokenReached(actualToken)) {
LOGGER.info("Max Token Reached!");
break;
}
SimpleStatement stmt = new SimpleStatement(queryBuilder(actualToken));
stmt.setConsistencyLevel(ConsistencyLevel.ALL);
ResultSet resultSet = session.execute(stmt);
Iterator<Row> rows = resultSet.iterator();
if ( !rows.hasNext()){
break;
}
List<String> rowIds = new ArrayList<String>();
while (rows.hasNext()) {
Row row = rows.next();
Long leadTime = row.getLong("my_column");
if (myCondition(myCollumn)) {
String rowId = row.getString("key");
rowIds.add(rowId);
}
if (!rows.hasNext()) {
Long token = row.getLong("token(rowid)");
if (!rowIds.isEmpty()) {
LOGGER.info(String.format("Keys found! RowId's: %s ", rowIds));
}
actualToken = nextToken(token);
}
}
}
LOGGER.info("Done!");
cluster.shutdown();
}
public boolean maxTokenReached(Long actualToken){
return actualToken >= maxToken;
}
public String queryBuilder(Long nextRange) {
return String.format("select token(key), key, my_column from mytable where token(key) >= %s limit 10000;", nextRange.toString());
}
public Long nextToken(Long token) {
return token + 1;
}
这篇关于卡桑德拉全表扫描中的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!