如何使用`ssc.fileStream()`读取实木复合地板文件?传递给`ssc.fileStream()`的类型是什么? [英] How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

查看:156
本文介绍了如何使用`ssc.fileStream()`读取实木复合地板文件?传递给`ssc.fileStream()`的类型是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Spark的fileStream()方法的理解是,它采用三种类型作为参数:KeyValueFormat.对于文本文件,适当的类型为:LongWritableTextTextInputFormat.

My understanding of Spark's fileStream() method is that it takes three types as parameters: Key, Value, and Format. In case of text files, the appropriate types are: LongWritable, Text, and TextInputFormat.

首先,我想了解这些类型的性质.凭直觉,我猜想在这种情况下,Key是文件的行号,而Value是该行上的文本.因此,在以下文本文件示例中:

First, I want to understand the nature of these types. Intuitively, I would guess that the Key in this case is the line number of the file, and the Value is the text on that line. So, in the following example of a text file:

Hello
Test
Another Test

DStream的第一行的Key1(0?),而ValueHello.

The first row of the DStream would have a Key of 1 (0?) and a Value of Hello.

这正确吗?


问题的第二部分:我查看了ParquetInputFormat的反编译实现,发现有些奇怪:


Second part of my question: I looked at the decompiled implementation of ParquetInputFormat and I noticed something curious:

public class ParquetInputFormat<T>
       extends FileInputFormat<Void, T> {
//...

public class TextInputFormat
       extends FileInputFormat<LongWritable, Text>
       implements JobConfigurable {
//...

TextInputFormat扩展了类型LongWritableTextFileInputFormat,而ParquetInputFormat扩展了类型VoidT的同一类.

TextInputFormat extends FileInputFormat of types LongWritable and Text, whereas ParquetInputFormat extends the same class of types Void and T.

这是否意味着我必须创建一个Value类来容纳整个镶木地板数据行,然后将类型<Void, MyClass, ParquetInputFormat<MyClass>>传递给ssc.fileStream()?

Does this mean that I must create a Value class to hold an entire row of my parquet data, and then pass the types <Void, MyClass, ParquetInputFormat<MyClass>> to ssc.fileStream()?

如果是,我应该如何实现MyClass?

If so, how should I implement MyClass?


编辑1 :我注意到了要传递给ParquetInputFormat对象的readSupportClass.这是什么样的类,如何将其用于解析镶木地板文件?有一些文档可以解决这个问题吗?


EDIT 1: I have noticed a readSupportClass which is to be passed to ParquetInputFormat objects. What kind of class is this, and how is it used to parse the parquet file? Is there some documentation that covers this?


编辑2 :据我所知,这是不可能.如果有人知道如何将镶木地板文件流式传输到Spark,请随时共享...


EDIT 2: As far as I can tell, this is impossible. If anybody knows how to stream in parquet files to Spark then please feel free to share...

推荐答案

下面是我在Spark Streaming中读取镶木地板文件的示例.

My sample to read parquet files in Spark Streaming is below.

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
  directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)

val lines = stream.map(row => {
  println("row:" + row.toString())
  row
})

有些要点是...

  • 记录类型为GenericRecord
  • readSupportClass是AvroReadSupport
  • 将配置传递给fileStream
  • 将parquet.read.support.class设置为配置

我参考下面的源代码来创建示例.
而且我也找不到很好的例子.
我想等一个更好的.

I referred to source codes below for creating sample.
And I also could not find good examples.
I would like to wait better one.

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https ://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

这篇关于如何使用`ssc.fileStream()`读取实木复合地板文件?传递给`ssc.fileStream()`的类型是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆