从Cassandra读取数据以在Flink中进行处理 [英] Read data from Cassandra for processing in Flink

查看:556
本文介绍了从Cassandra读取数据以在Flink中进行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用Flink作为流引擎来处理来自Kafka的数据流.为了对数据进行分析,我需要查询Cassandra中的一些表.做这个的最好方式是什么?我一直在Scala中寻找此类案例的例子.但是我找不到任何东西,如何使用Scala作为编程语言在Flink中读取Cassandra中的数据? 阅读并阅读使用apache flink Java API将数据写入cassandra 在同一行上还有另一个问题.它在答案中提到了多种方法.我想知道我的情况下最好的方法是什么.同样,大多数可用示例都在Java中.我正在寻找Scala示例.

I have to process data streams from Kafka using Flink as the streaming engine. To do the analysis on the data, I need to query some tables in Cassandra. What is the best way to do this? I have been looking for examples in Scala for such cases. But I couldn't find any.How can data from Cassandra be read in Flink using Scala as the programming language? Read & write data into cassandra using apache flink Java API has another question on the same lines. It has multiple approaches mentioned in the answers. I would like to know what is the best approach in my case. Also, most of the examples available are in Java. I am looking for Scala examples.

推荐答案

我目前在flink 1.3中使用asyncIO从cassandra中读取.这是有关它的文档:

I currently read from cassandra using asyncIO in flink 1.3. Here is the documentation on it:

https://ci .apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html (在具有DatabaseClient的位置,您将改用com.datastax.drive.core.Cluster)

https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html (where it has DatabaseClient, you will use the com.datastax.drive.core.Cluster instead)

让我知道您是否需要一个更深入的示例来专门用于从cassandra中读取它,但是很遗憾,我只能在Java中提供一个示例.

Let me know if you need a more in depth example for using it to read from cassandra specifically, but I unfortunately can only provide an example in java.

编辑1

这是我使用flink的Async I/O从Cassandra读取的代码示例.我仍在确定和解决一个问题,由于某种原因(无需深入研究),单个查询将返回大量数据,即使看起来像Cassandra可以很好地返回,异步数据流的超时也会被触发并早于超时时间.但是,假设这只是我正在做的其他事情的错误,而不是因为这段代码,这对您应该很好(对我来说也工作了好几个月):

Here is an example of the code I am using for reading from Cassandra with flink's Async I/O. I am still working on identifying and fixing an issue where for some reason (without going deep into it) for large amounts of data being returned by a single query, the async data stream's timeout is triggered even though it looks to be returned fine by Cassandra and well before the timeout time. But assuming that is just a bug with other stuff I am doing and not because of this code, this should work fine for you (and has worked fine for months for me as well):

public class GenericCassandraReader extends RichAsyncFunction<CustomInputObject, ResultSet> {

    private final Properties props;
    private Session client;

    public GenericCassandraReader(Properties props) {
        super();
        this.props = props;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        client = Cluster.builder()
                .addContactPoint(props.cassandraUrl)
                .withPort(props.cassandraPort)
                .build()
                .connect(props.cassandraKeyspace);
    }

    @Override
    public void close() throws Exception {
        client.close();
    }

    @Override
    public void asyncInvoke(final CustomInputObject customInputObject, final AsyncCollector<ResultSet> asyncCollector) throws Exception {

        String queryString = "select * from table where fieldToFilterBy='" + customInputObject.id() + "';";

        ListenableFuture<ResultSet> resultSetFuture = client.executeAsync(queryString);

        Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {

            public void onSuccess(ResultSet resultSet) {
                asyncCollector.collect(Collections.singleton(resultSet));
            }

            public void onFailure(Throwable t) {
                asyncCollector.collect(t);
            }
        });
    }
}

再次,对于您的延迟,我们深表歉意.希望能够解决该错误,以便可以确定,但是在这一点上,仅参考一下总比没有好.

Again, sorry for the delay. Was hoping to have the bug resolved so I could be certain, but figured at this point just having some reference would be better than nothing.

编辑2

因此,我们终于确定问题不是代码,而是网络吞吐量.许多字节试图通过一个不足够大的管道来处理它,东西开始备份,一些开始滴入,但是(由于datastax cassandra驱动程序的QueryLogger我们可以看到)接收结果的时间每个查询开始爬到4秒,然后是6,然后是8,依此类推.

So we came to finally determine that the issue isn't with the code, but with the network throughput. Lot of bytes trying to come through a pipe that isn't large enough to handle it, stuff starts backing up, some start trickling in but (thanks to datastax cassandra driver's QueryLogger we could see this) the time it took to receive the result of each query started climbing to 4 seconds, then 6, then 8 and so on.

TL; DR,代码很好,请注意,如果您遇到Flink的asyncWaitOperator中的timeoutExceptions,则可能是网络问题.

编辑2.5

还意识到,由于网络延迟问题,我们最终转向使用RichMapFunction来保持从卡桑德拉读取的数据处于状态,这可能是有益的.因此,该作业只是跟踪通过它的所有记录,而不必每次通过新记录就可以从表中读取所有内容.

Also realized that it might be beneficial to mention that because of the network latency issue, we ended up moving to using a RichMapFunction that holds the data we were reading from cassandra in state. So the job just keeps track of all the records that come through it instead of having to read from the table each time a new record comes through to get all that are in there.

这篇关于从Cassandra读取数据以在Flink中进行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆