Apache Spark Kinesis Integration:已连接，但未收到任何记录 [英] Apache Spark Kinesis Integration: connected, but no records received

查看：53 发布时间：2021/4/3 19:13:47 apache-spark spark-streaming amazon-kinesis

本文介绍了Apache Spark Kinesis Integration:已连接，但未收到任何记录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

tldr; ，因为它不接收数据，所以不能使用Kinesis Spark Streaming集成.

tldr; Can't use Kinesis Spark Streaming integration, because it receives no data.

已建立测试流，nodejs应用每秒发送1条简单记录.
在环境中使用docker-compose，AWS凭证设置具有主节点和工作节点(4个核心)的标准Spark 1.5.2集群
spark-streaming-kinesis-asl-assembly_2.10-1.5.2.jar 已下载并添加到类路径
job.py 或 job.jar (仅读取和打印).
一切似乎都还可以，但是没有收到任何记录.

Testing stream is set up, nodejs app sends 1 simple record per second.
Standard Spark 1.5.2 cluster is set up with master and worker nodes (4 cores) with docker-compose, AWS credentials in environment
spark-streaming-kinesis-asl-assembly_2.10-1.5.2.jar is downloaded and added to classpath
job.py or job.jar (just reads and prints) submitted.
Everything seems to be okay, but no records what-so-ever are received.

KCL Worker线程有时会说"Sleeping ..."-它可能会被无声地破坏(我检查了我能找到的所有stderr，但没有任何提示).也许吞没了OutOfMemoryError ...但是我对此表示怀疑，因为每秒1条记录的数量.

From time to time the KCL Worker thread says "Sleeping ..." - it might be broken silently (I checked all the stderr I could find, but no hints). Maybe swallowed OutOfMemoryError... but I doubt that, because of the amount of 1 record per second.



    -------------------------------------------
    Time: 1448645109000 ms
    -------------------------------------------

    15/11/27 17:25:09 INFO JobScheduler: Finished job streaming job 1448645109000 ms.0 from job set of time 1448645109000 ms
    15/11/27 17:25:09 INFO KinesisBackedBlockRDD: Removing RDD 102 from persistence list
    15/11/27 17:25:09 INFO JobScheduler: Total delay: 0.002 s for time 1448645109000 ms (execution: 0.001 s)
    15/11/27 17:25:09 INFO BlockManager: Removing RDD 102
    15/11/27 17:25:09 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[102] at createStream at NewClass.java:25 of time 1448645109000 ms
    15/11/27 17:25:09 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1448645107000 ms)
    15/11/27 17:25:09 INFO InputInfoTracker: remove old batch metadata: 1448645107000 ms
    15/11/27 17:25:10 INFO JobScheduler: Added jobs for time 1448645110000 ms
    15/11/27 17:25:10 INFO JobScheduler: Starting job streaming job 1448645110000 ms.0 from job set of time 1448645110000 ms
    -------------------------------------------
    Time: 1448645110000 ms
    -------------------------------------------
          <----- Some data expected to show up here!
    15/11/27 17:25:10 INFO JobScheduler: Finished job streaming job 1448645110000 ms.0 from job set of time 1448645110000 ms
    15/11/27 17:25:10 INFO JobScheduler: Total delay: 0.003 s for time 1448645110000 ms (execution: 0.001 s)
    15/11/27 17:25:10 INFO KinesisBackedBlockRDD: Removing RDD 103 from persistence list
    15/11/27 17:25:10 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[103] at createStream at NewClass.java:25 of time 1448645110000 ms
    15/11/27 17:25:10 INFO BlockManager: Removing RDD 103
    15/11/27 17:25:10 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1448645108000 ms)
    15/11/27 17:25:10 INFO InputInfoTracker: remove old batch metadata: 1448645108000 ms
    15/11/27 17:25:11 INFO JobScheduler: Added jobs for time 1448645111000 ms
    15/11/27 17:25:11 INFO JobScheduler: Starting job streaming job 1448645111000 ms.0 from job set of time 1448645111000 ms

请让我知道任何提示，我真的很想使用Spark进行实时分析...除了不接收数据的小细节外，其他所有方法都可以:

Please let me know any hints, I'd really like to use Spark for real time analytics... everything but this small detail of not receiving data :) seems to be ok.

PS:我感到奇怪的是，Spark忽略了我的存储级别(内存和磁盘2)和检查点间隔(20,000 ms)的设置

PS: I find strange that somehow Spark ignores my settings of Storage level (mem and disk 2) and Checkpoint interval (20,000 ms)



    15/11/27 17:23:26 INFO KinesisInputDStream: metadataCleanupDelay = -1
    15/11/27 17:23:26 INFO KinesisInputDStream: Slide time = 1000 ms
    15/11/27 17:23:26 INFO KinesisInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
    15/11/27 17:23:26 INFO KinesisInputDStream: Checkpoint interval = null
    15/11/27 17:23:26 INFO KinesisInputDStream: Remember duration = 1000 ms
    15/11/27 17:23:26 INFO KinesisInputDStream: Initialized and validated org.apache.spark.streaming.kinesis.KinesisInputDStream@74b21a6

源代码(java):



    public class NewClass {

        public static void main(String[] args) {
            SparkConf conf = new SparkConf().setAppName("appname").setMaster("local[3]");
            JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
            JavaReceiverInputDStream kinesisStream = KinesisUtils.createStream(
                    ssc, "webassist-test", "test", "https://kinesis.us-west-1.amazonaws.com", "us-west-1",
                    InitialPositionInStream.LATEST,
                    new Duration(20000),
                    StorageLevel.MEMORY_AND_DISK_2()
            );
            kinesisStream.print();
            ssc.start();
            ssc.awaitTermination();
        }
    }

Python代码(在尝试将pprinting并发送到MongoDB之前都尝试过):

Python code (tried both pprinting before and sending to MongoDB):



    from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
    from pyspark import SparkContext, StorageLevel
    from pyspark.streaming import StreamingContext
    from sys import argv

    sc = SparkContext(appName="webassist-test")
    ssc = StreamingContext(sc, 5)

    stream = KinesisUtils.createStream(ssc,
         "appname",
         "test",
         "https://kinesis.us-west-1.amazonaws.com",
         "us-west-1",
         InitialPositionInStream.LATEST,
         5,
         StorageLevel.MEMORY_AND_DISK_2)

    stream.pprint()
    ssc.start()
    ssc.awaitTermination()

注意:我还尝试使用 stream.foreachRDD(lambda rdd:rdd.foreachPartition(send_partition))将数据发送到MongoDB，但是没有将其粘贴到此处，因为您需要一个MongoDB实例并且它是与问题无关-输入中没有记录.

Note: I also tried sending data to MongoDB with stream.foreachRDD(lambda rdd: rdd.foreachPartition(send_partition)) but not pasting it here, since you'd need a MongoDB instance and it's not related to the problem - no records come in on the input already.

还有一件事-KCL永远不会提交.相应的DynamoDB如下所示:

One more thing - the KCL never commits. The corresponding DynamoDB looks like this:


leaseKey  checkpoint  leaseCounter  leaseOwner  ownerSwitchesSinceCheckpoint
shardId-000000000000  LATEST  614  localhost:d92516...  8

用于提交的命令:

spark-submit --executor-memory 1024m --master spark://IpAddress:7077 /path/test.py

在MasterUI中，我可以看到:

In the MasterUI I can see:

 Input Rate
   Receivers: 1 / 1 active
   Avg: 0.00 events/sec
 KinesisReceiver-0
   Avg: 0.00 events/sec
...
 Completed Batches (last 76 out of 76)

感谢您的帮助！

Apache Spark Kinesis Integration:已连接，但未收到任何记录 [英] Apache Spark Kinesis Integration: connected, but no records received

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark Kinesis Integration:已连接，但未收到任何记录 [英] Apache Spark Kinesis Integration: connected, but no records received

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭