在同一个Spark会话中运行多个Spark Kafka结构化流查询会增加偏移量,但显示numInputRows 0 [英] Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0

查看:312
本文介绍了在同一个Spark会话中运行多个Spark Kafka结构化流查询会增加偏移量,但显示numInputRows 0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark结构化流,使用了来自Kafka主题的2个分区的记录.

I have a Spark Structured Streaming consuming records from Kafka topic with 2 partition.

Spark Job::2个查询,每个查询都来自2个单独的分区,并在同一spark会话中运行.

Spark Job: 2 queries, each consuming from 2 separate partition, running from same spark session.

    val df1 = session.readStream.format("kafka")
            .option("kafka.bootstrap.servers", kafkaBootstrapServer)
            .option("assign", "{\"multi-stream1\" : [0]}")
            .option("startingOffsets", latest)
            .option("key.deserializer", classOf[StringDeserializer].getName)
            .option("value.deserializer", classOf[StringDeserializer].getName)
            .option("max.poll.records", 500)
            .option("failOnDataLoss", true)
            .load()
    val query1 = df1
            .select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
            .select("key","data.*")
            .writeStream.format("parquet").option("path", path).outputMode("append")
            .option("checkpointLocation", checkpoint_dir1)
            .partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
            .queryName("query1").start()
    
    val df2 = session.readStream.format("kafka")
            .option("kafka.bootstrap.servers", kafkaBootstrapServer)
            .option("assign", "{\"multi-stream1\" : [1]}")
            .option("startingOffsets", latest)
            .option("key.deserializer", classOf[StringDeserializer].getName)
            .option("value.deserializer", classOf[StringDeserializer].getName)
            .option("max.poll.records", 500)
            .option("failOnDataLoss", true)
            .load()
val query2 = df2.select(col("key").cast("string"),from_json(col("value").cast("string"), schema, Map.empty[String, String]).as("data"))
            .select("key","data.*")
            .writeStream.format("parquet").option("path", path).outputMode("append")
            .option("checkpointLocation", checkpoint_dir2)
            .partitionBy("key")/*.trigger(Trigger.ProcessingTime("5 seconds"))*/
            .queryName("query2").start()
    session.streams.awaitAnyTermination()

问题:每次在两个分区中推送记录时,两个查询都会显示进度,但是只有其中一个会发出输出.我可以看到那些记录已处理的查询的输出.例如,Kafka分区0-记录被推送,spark将处理查询1.Kafka分区1-在query1忙于处理时推送记录,spark将显示开始偏移和结束偏移递增,但查询2的numInputRows = 0.

Problem: every time the records are pushed in both the partition, both queries show progress, but only one of them is emitting the output. I can see the output from those query whose records are processed. For e.g., Kafka Partition 0 - records are pushed, spark will process the query1. Kafka Partition 1 - records are pushed when the query1 is busy processing, spark will show the start offset and end offset incremented, but numInputRows = 0 for query 2.

正在运行环境:本地PC-同样的问题.Dataproc集群-spark-submit --packages

Running env: Local PC - Same problem. Dataproc cluster - spark-submit --packages

org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 --classorg.DifferentPartitionSparkStreaming --master yarn --deploy-mode群集--num-executors 2-驱动程序内存4g --executor-cores 4--executor-memory 4g gs://dpl-ingestion-event/jars/stream_consumer-jar-with-dependencies.jar"{"多个流":[0]}"最新的"10.w.x.y:9092,10.r.s.t:9092,10.a.b.c:9092""{"多个流":[1]}"-同样的问题.

org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 --class org.DifferentPartitionSparkStreaming --master yarn --deploy-mode cluster --num-executors 2 --driver-memory 4g --executor-cores 4 --executor-memory 4g gs://dpl-ingestion-event/jars/stream_consumer-jar- with-dependencies.jar "{"multiple-streaming" : [0]}" latest "10.w.x.y:9092,10.r.s.t:9092,10.a.b.c:9092" "{"multiple-streaming" : [1]}" - Same problem.

检查点和输出路径是Google Bucket.

Checkpoint and output path is Google Bucket.

日志

20/07/24 19:37:27 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "e7d026f7-bf62-4a86-8697-a95a2fc893bb",
  "runId" : "21169889-6e4b-419d-b338-2d4d61999f5b",
  "name" : "reconcile",
  "timestamp" : "2020-07-24T14:06:55.002Z",
  "batchId" : 2,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "addBatch" : 3549,
    "getBatch" : 0,
    "getEndOffset" : 1,
    "queryPlanning" : 32,
    "setOffsetRange" : 1,
    "triggerExecution" : 32618,
    "walCommit" : 15821
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Assign[multi-stream1-1]]",
    "startOffset" : {
      "multi-stream1" : {
        "1" : 240
      }
    },
    "endOffset" : {
      "multi-stream1" : {
        "1" : 250
      }
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "FileSink[gs://dpl-ingestion-event/demo/test/single-partition/data]"
  }

推荐答案

我能够解决此问题.根本原因是两个查询都试图写入相同的基本路径.因此,_spark_meta信息存在重叠.Spark结构化流维护检查点以及_spark_metadata文件,以跟踪正在处理的批处理.

I was able to resolve the problem. The root cause was that both the queries were trying to write to the same base path. Thus there was an overlap of the _spark_meta information. Spark Structured Streaming maintain checkpointing, as well as _spark_metadata file to keep track of the batch being processed.

来源Spark文档:

为了在正确维护一次的同时正确处理部分故障语义上,每个批次的文件都写成唯一的目录,然后自动附加到元数据日志.当一个基于实木复合地板的数据源已初始化以供读取,我们首先检查此日志目录,并在以下情况下使用它而不是文件列表礼物.

In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

因此,现在应该为每个查询指定一个单独的路径.与检查点不同,没有选项可以配置_spark_matadata位置.

Thus for now every query should be given a separate path. There is no option to configure the _spark_matadata location, unlike in checkpointing.

这篇关于在同一个Spark会话中运行多个Spark Kafka结构化流查询会增加偏移量,但显示numInputRows 0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆