如何在流数据集中加载tar.gz文件? [英] How to load tar.gz files in streaming datasets?

查看：75 发布时间：2021/4/8 19:27:18 apache-spark spark-structured-streaming

本文介绍了如何在流数据集中加载tar.gz文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从包含我的实际CSV存储数据的tar-gzip文件(tgz)中进行流式传输.

I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.

当我的数据作为CSV文件输入时，我已经设法使用spark 2.2进行结构化流式传输，但是实际上，数据作为gzip压缩后的csv文件输入.

I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.

在处理CSV流之前，结构化流完成的触发器是否有解压缩的方式?

Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?

我用来处理文件的代码是这样的:

The code I use to process the files is this:

val schema = Encoders.product[RawData].schema
val trackerData = spark
  .readStream
  .option("delimiter", "\t")
  .schema(schema)
  .csv(path)
val exceptions = rawCientData
  .as[String]
  .flatMap(extractExceptions)
  .as[ExceptionData]

当路径指向csv文件时，

产生了预期的输出.但是我想使用tar gzip文件.当我尝试将这些文件放置在给定的路径时，没有任何异常，并且批处理输出告诉我

produced output as expected when path points to csv files. But I would like to use tar gzip files. When I try to place those files at the given path, I do not get any exceptions and batch output tells me

  "sources" : [ {
    "description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
    "startOffset" : null,
    "endOffset" : {
      "logOffset" : 0
    },
    "numInputRows" : 1095,
    "processedRowsPerSecond" : 211.0233185584891
  } ],

但是我没有处理任何实际数据.控制台接收器如下所示:

But I do not get any actual data processed. Console sink looks like this:

+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+

如何在流数据集中加载tar.gz文件? [英] How to load tar.gz files in streaming datasets?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在流数据集中加载tar.gz文件? [英] How to load tar.gz files in streaming datasets?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭