Spark结构化流文件源开始偏移 [英] Spark Structured Streaming File Source Starting Offset

查看：138 发布时间：2020/9/4 8:47:38 apache-spark apache-spark-sql spark-streaming

本文介绍了Spark结构化流文件源开始偏移的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有一种方法可以指定Spark结构化文件流源的起始偏移量?

Is there a way how to specify starting offset for Spark Structured File Stream Source ?

我正在尝试从HDFS流式传输实木地板:

I am trying to stream parquets from HDFS:

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.readStream
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()

如我所见，第一个运行是处理路径中检测到的所有可用文件，然后将偏移量保存到检查点位置，并且仅处理新文件，即接受使用期限并且在看到的文件中不存在的新文件.

As I see, the first run is processing all available files detected in path, then save offsets to checkpoint location and process only new files, that is accept age and does not exist in files seen map.

我正在寻找一种方法，如何指定起始偏移量或时间戳或选项数量，以在首次运行时不处理所有可用文件.

I'm looking for a way, how to specify starting offset or timestamp or number of options to do not process all available files in the first run.

有没有我要找的方法?

Is there a way I'm looking for?

Spark结构化流文件源开始偏移 [英] Spark Structured Streaming File Source Starting Offset

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark结构化流文件源开始偏移 [英] Spark Structured Streaming File Source Starting Offset

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭