我如何在实木复合地板的文件使用`ssc.fileStream()读取`,什么是传递给`ssc.fileStream类型的性质()` [英] How do I read in parquet files using `ssc.fileStream()`, and what is the nature of the types passed to `ssc.fileStream()`

查看:601
本文介绍了我如何在实木复合地板的文件使用`ssc.fileStream()读取`,什么是传递给`ssc.fileStream类型的性质()`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的星火的理解 FILESTREAM()方法是,它需要三种类型作为参数 - K EY, V ALUE和˚F ORMAT。在文本文件的情况下,适当的类型是 LongWritable 文本的TextInputFormat 。我首先要了解这些类型的性质。凭直觉,我猜的在这种情况下该文件的行号,而是文字在该行。因此,在一个文本文件的以下示例:

My understanding of Spark's fileStream() method is that it takes three types as parameters - Key, Value, and Format. In the case of text files, the appropriate types are LongWritable, Text, and TextInputFormat. I firstly want to understand the nature of these types. Intuitively, I would guess that the Key in this case is the line number of the file, and the Value is the text on that line. So in the following example of a text file:

Hello
Test
Another Test

在DSTREAM的第一行会有一个 1 0 ?)和您好

这是正确的?

我的问题的另一部分:我看了一下反编译执行 ParquetInputFormat 中,我注意到一些奇怪的:

The next part of my question: I looked at the decompiled implementation of ParquetInputFormat and I noticed something curious:

在这里输入的形象描述

在这里输入的形象描述

的TextInputFormat 扩展 FileInputFormat 的类型 LongWritable 文本,而 ParquetInputFormat 扩展同一个类的类型虚空 T

TextInputFormat extends FileInputFormat of types LongWritable and Text, whereas ParquetInputFormat extends the same class of types Void and T.

这是否意味着我必须创建一个类来保存我的拼花数据的整个行,然后通过类型 [虚空, MyClass的,ParquetInputFormat [MyClass的] ssc.fileStream()

Does this mean that I must create a Value class to hold an entire row of my parquet data, and then pass the types [Void, MyClass, ParquetInputFormat[MyClass]] to ssc.fileStream()?

如果是这样,我应该如何实现MyClass的?

If so, how should I implement MyClass?

任何其他指导极大的欢迎。

Any other guidance is greatly welcomed.

编辑:我注意到一个 readSupportClass 这是要传递给 ParquetInputFormat 的对象。什么样的阶层,这是和如何使用它来解析实木复合地板的文件吗?有什么事情,我应该知道和了解?

I have noticed a readSupportClass which is to be passed to ParquetInputFormat objects. What kind of class is this and how is it used to parse the parquet file? Is it something I should know and understand?

顺便说一句 - 有一些覆盖此文件?我找不到任何。

As an aside - is there some documentation that covers this? I couldn't find any.

编辑2:据我所知,这是的无法。如果有人知道如何在实木复合地板的文件传输到火花,然后请随时分享。

EDIT 2: As far as I can tell, this is impossible. If anybody knows how to stream in parquet files to Spark then please feel free to share.

推荐答案

我的样品到拼花文件星火流如下。

My sample to read parquet files in Spark Streaming is below.

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
  directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)

val lines = stream.map(row => {
  println("row:" + row.toString())
  row
})

有些点是...

Some points are ...


  • 记录类型是GenericRecord

  • readSupportClass是AvroReadSupport

  • 通过配置来FILESTREAM

  • 设置parquet.read.support.class的配置

我提到来源$ C ​​$ CS下面创建样本。结果
而且我也找不到很好的例子。结果
我想等待更好的。

I referred to source codes below for creating sample.
And I also could not find good examples.
I would like to wait better one.

<一个href=\"https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala\" rel=\"nofollow\">https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala

<一href=\"https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java\" rel=\"nofollow\">https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java

<一href=\"https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala\" rel=\"nofollow\">https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

这篇关于我如何在实木复合地板的文件使用`ssc.fileStream()读取`,什么是传递给`ssc.fileStream类型的性质()`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆