卡桑德拉全表扫描中的问题 [英] Issue in full table scan in cassandra

查看:59
本文介绍了卡桑德拉全表扫描中的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先:我知道在Cassandra中进行全面扫描不是一个好主意,但是,目前,这是我所需要的.

First: I know isn't a good idea do a full scan in Cassandra, however, at moment, is that what I need.

当我开始寻找这样的东西时,我读到人们说不可能在Cassandra中进行全面扫描,而他并不是被迫做这种事情的.

When I started look for do someting like this I read people saying wasn't possible do a full scan in Cassandra and he wasn't made to do this type of thing.

不满意,我一直在寻找直到找到这篇文章为止: http://www.myhowto.org/bigdata/2013/11/04/扫描整个cassandra-column-family-with-cql/

Not satisfied, I keep looking until I found this article: http://www.myhowto.org/bigdata/2013/11/04/scanning-the-entire-cassandra-column-family-with-cql/

看起来很合理,我尝试了一下.因为我将只执行一次完整扫描,并且时间和性能都不是问题,所以我编写了查询并将其放在一个简单的Job中,以查找所需的所有记录.从20亿行记录中,我的预期输出约为1000,但是,我只有100条记录.

Look like pretty reasonable and I gave it a try. As I will do this full scan only once and time and performance isn't a issue, I wrote the query and put this in a simple Job to lookup all the records that I want. From 2 billions rows of records, something like 1000 was my expected output, however, I had only 100 records.

我的工作:

public void run() {
    Cluster cluster = getConnection();
    Session session = cluster.connect("db");

    LOGGER.info("Starting ...");

    boolean run = true;
    int print = 0;

    while ( run ) {
        if (maxTokenReached(actualToken)) {
            LOGGER.info("Max Token Reached!");
            break;
        }
        ResultSet resultSet = session.execute(queryBuilder(actualToken));

        Iterator<Row> rows = resultSet.iterator();
        if ( !rows.hasNext()){
            break;
        }

        List<String> rowIds = new ArrayList<String>();

        while (rows.hasNext()) {
            Row row = rows.next();

            Long leadTime = row.getLong("my_column");
            if (myCondition(myCollumn)) {
                String rowId = row.getString("key");
                rowIds.add(rowId);
            }

            if (!rows.hasNext()) {
                Long token = row.getLong("token(rowid)");
                if (!rowIds.isEmpty()) {
                    LOGGER.info(String.format("Keys found! RowId's: %s ", rowIds));
                }
                actualToken = nextToken(token);
            }

        }

    }
    LOGGER.info("Done!");
    cluster.shutdown();
}

public boolean maxTokenReached(Long actualToken){
    return actualToken >= maxToken;
}

public String queryBuilder(Long nextRange) {
    return String.format("select token(key), key, my_column from mytable where token(key) >= %s limit 10000;", nextRange.toString());
}

public Long nextToken(Long token){
    return token + 1;
}

基本上,我要做的是搜索允许的最小令牌,并逐渐增加到最后一个.

Basically what I do is search for the min token allowed and incrementally go until the last.

我不知道,但是就像这项工作没有完全进行完全扫描,或者我的查询仅访问了一个节点或其他内容.我不知道我是在做错什么,还是真的不可能进行全面扫描.

I don't know, but is like the job had not done the full-scan totally or my query had only accessed only one node or something. I don't know if I'm doing something wrong, or is not really possible do a full scan.

今天,我有将近2 TB的数据,只有一个表包含七个节点.

Today I have almost 2 TB of data, only one table in one cluster of seven nodes.

有人已经遇到这种情况或有什么建议吗?

Someone already has been in this situation or have some recommendation?

推荐答案

在Cassandra中进行全表扫描绝对是可能的-实际上,对于Spark这样的事情来说,这是相当普遍的.但是,它通常不是快速"的,因此不建议这样做,除非您知道为什么要这样做.对于您的实际问题:

It's definitely possible to do a full table scan in Cassandra - indeed, it's quite common for things like Spark. However, it's not typically "fast", so it's discouraged unless you know why you're doing it. For your actual questions:

1)如果您使用的是CQL,则几乎可以肯定使用的是Murmur3分区程序,因此最小令牌为-9223372036854775808(最大令牌为9223372036854775808).

1) If you're using CQL, you're almost certainly using Murmur3 partitioner, so your minimum token is -9223372036854775808 (and maximum token is 9223372036854775808).

2)您正在使用session.execute(),它将使用默认一致性ONE,这可能不会返回群集中的所有结果,尤其是如果您还在使用ONE编写时,我怀疑您或许.将其提高为ALL,并使用准备好的语句来加速CQL解析:

2) You're using session.execute(), which will use a default consistency of ONE, which may not return all of the results in your cluster, especially if you're also writing at ONE, which I suspect you may be. Raise that to ALL, and use prepared statements to speed up the CQL parsing:

 public void run() {
     Cluster cluster = getConnection();
     Session session = cluster.connect("db");
     LOGGER.info("Starting ...");
     actualToken = -9223372036854775808;
     boolean run = true;
     int print = 0;

     while ( run ) {
         if (maxTokenReached(actualToken)) {
             LOGGER.info("Max Token Reached!");
             break;
         }
         SimpleStatement stmt = new SimpleStatement(queryBuilder(actualToken));
         stmt.setConsistencyLevel(ConsistencyLevel.ALL);
         ResultSet resultSet = session.execute(stmt);

         Iterator<Row> rows = resultSet.iterator();
         if ( !rows.hasNext()){
             break;
         }

         List<String> rowIds = new ArrayList<String>();

         while (rows.hasNext()) {
             Row row = rows.next();

             Long leadTime = row.getLong("my_column");
             if (myCondition(myCollumn)) {
                 String rowId = row.getString("key");
                 rowIds.add(rowId);
             }

             if (!rows.hasNext()) {
                 Long token = row.getLong("token(rowid)");
                 if (!rowIds.isEmpty()) {
                     LOGGER.info(String.format("Keys found! RowId's: %s ", rowIds));
                 }
             actualToken = nextToken(token);
             }
         }
      }
     LOGGER.info("Done!");
     cluster.shutdown(); 
  }

public boolean maxTokenReached(Long actualToken){
     return actualToken >= maxToken; 
 }

 public String queryBuilder(Long nextRange) {
     return String.format("select token(key), key, my_column from mytable where token(key) >= %s limit 10000;", nextRange.toString()); 
 }

 public Long nextToken(Long token) {
     return token + 1; 
 }

这篇关于卡桑德拉全表扫描中的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆