Spark 结构化流文件源起始偏移量 [英] Spark Structured Streaming File Source Starting Offset

查看：25 发布时间：2021/11/14 22:43:54 apache-spark apache-spark-sql spark-streaming

本文介绍了Spark 结构化流文件源起始偏移量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有办法指定 Spark Structured File Stream Source 的起始偏移量?

Is there a way how to specify starting offset for Spark Structured File Stream Source ?

我正在尝试从 HDFS 流式传输镶木地板:

I am trying to stream parquets from HDFS:

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.readStream
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()

如我所见，第一次运行是处理路径中检测到的所有可用文件，然后将偏移量保存到检查点位置并仅处理新文件，即接受年龄并且不存在于文件可见地图中.

As I see, the first run is processing all available files detected in path, then save offsets to checkpoint location and process only new files, that is accept age and does not exist in files seen map.

我正在寻找一种方法，如何指定起始偏移量或时间戳或选项数量，以便在第一次运行时不处理所有可用文件.

I'm looking for a way, how to specify starting offset or timestamp or number of options to do not process all available files in the first run.

有我正在寻找的方法吗?

Is there a way I'm looking for?

Spark 结构化流文件源起始偏移量 [英] Spark Structured Streaming File Source Starting Offset

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 结构化流文件源起始偏移量 [英] Spark Structured Streaming File Source Starting Offset

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭