Spark结构化流文件源开始偏移 [英] Spark Structured Streaming File Source Starting Offset

查看:138
本文介绍了Spark结构化流文件源开始偏移的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法可以指定Spark结构化文件流源的起始偏移量?

Is there a way how to specify starting offset for Spark Structured File Stream Source ?

我正在尝试从HDFS流式传输实木地板:

I am trying to stream parquets from HDFS:

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.readStream
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()

如我所见,第一个运行是处理路径中检测到的所有可用文件,然后将偏移量保存到检查点位置,并且仅处理新文件,即接受使用期限并且在看到的文件中不存在的新文件.

As I see, the first run is processing all available files detected in path, then save offsets to checkpoint location and process only new files, that is accept age and does not exist in files seen map.

我正在寻找一种方法,如何指定起始偏移量或时间戳或选项数量,以在首次运行时不处理所有可用文件.

I'm looking for a way, how to specify starting offset or timestamp or number of options to do not process all available files in the first run.

有没有我要找的方法?

Is there a way I'm looking for?

推荐答案

感谢@jayfah,据我发现,我们可以使用以下技巧来模拟Kafka最新的"起始偏移量:

Thanks @jayfah, as far as I found, we might simulate Kafka 'latest' starting offsets using following trick:

  1. 运行带有option("latestFirst", true)option("maxFilesPerTrigger", "1")的警告流,并带有检查点,虚拟接收器和巨大的处理时间.这样,热身流会将最新的文件时间戳保存到检查点.

  1. Run warn-up stream with option("latestFirst", true) and option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge processing time. This way, warm-up stream will save latest file timestamp to checkpoint.

使用同一检查点位置的option("maxFileAge", "0")(真实接收器)运行真实流.在这种情况下,流将仅处理新近可用的文件.

Run real stream with option("maxFileAge", "0"), real sink using the same checkpoint location. In this case stream will process only newly available files.

大部分可能不是生产所必需的,并且有更好的方法,例如重新整理数据路径等,但是至少我以此方式找到了问题的答案.

Most probably that is not necessary for production and there is better way, e.g. reorganize data paths etc., but this way at least I found as answer for my question.

这篇关于Spark结构化流文件源开始偏移的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆