如何使用`ssc.fileStream()`读取实木复合地板文件?传递给`ssc.fileStream()`的类型是什么? [英] How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?
问题描述
我对Spark的fileStream()
方法的理解是,它采用三种类型作为参数:Key
,Value
和Format
.对于文本文件,适当的类型为:LongWritable
,Text
和TextInputFormat
.
My understanding of Spark's fileStream()
method is that it takes three types as parameters: Key
, Value
, and Format
. In case of text files, the appropriate types are: LongWritable
, Text
, and TextInputFormat
.
首先,我想了解这些类型的性质.凭直觉,我猜想在这种情况下,Key
是文件的行号,而Value
是该行上的文本.因此,在以下文本文件示例中:
First, I want to understand the nature of these types. Intuitively, I would guess that the Key
in this case is the line number of the file, and the Value
is the text on that line. So, in the following example of a text file:
Hello
Test
Another Test
DStream
的第一行的Key
为1
(0
?),而Value
为Hello
.
The first row of the DStream
would have a Key
of 1
(0
?) and a Value
of Hello
.
这正确吗?
问题的第二部分:我查看了ParquetInputFormat
的反编译实现,发现有些奇怪:
Second part of my question: I looked at the decompiled implementation of ParquetInputFormat
and I noticed something curious:
public class ParquetInputFormat<T>
extends FileInputFormat<Void, T> {
//...
public class TextInputFormat
extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
//...
TextInputFormat
扩展了类型LongWritable
和Text
的FileInputFormat
,而ParquetInputFormat
扩展了类型Void
和T
的同一类.
TextInputFormat
extends FileInputFormat
of types LongWritable
and Text
, whereas ParquetInputFormat
extends the same class of types Void
and T
.
这是否意味着我必须创建一个Value
类来容纳整个镶木地板数据行,然后将类型<Void, MyClass, ParquetInputFormat<MyClass>>
传递给ssc.fileStream()
?
Does this mean that I must create a Value
class to hold an entire row of my parquet data, and then pass the types <Void, MyClass, ParquetInputFormat<MyClass>>
to ssc.fileStream()
?
如果是,我应该如何实现MyClass
?
If so, how should I implement MyClass
?
编辑1 :我注意到了要传递给ParquetInputFormat
对象的readSupportClass
.这是什么样的类,如何将其用于解析镶木地板文件?有一些文档可以解决这个问题吗?
EDIT 1: I have noticed a readSupportClass
which is to be passed to ParquetInputFormat
objects. What kind of class is this, and how is it used to parse the parquet file? Is there some documentation that covers this?
编辑2 :据我所知,这是不可能.如果有人知道如何将镶木地板文件流式传输到Spark,请随时共享...
EDIT 2: As far as I can tell, this is impossible. If anybody knows how to stream in parquet files to Spark then please feel free to share...
推荐答案
下面是我在Spark Streaming中读取镶木地板文件的示例.
My sample to read parquet files in Spark Streaming is below.
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)
val lines = stream.map(row => {
println("row:" + row.toString())
row
})
有些要点是...
- 记录类型为GenericRecord
- readSupportClass是AvroReadSupport
- 将配置传递给fileStream
- 将parquet.read.support.class设置为配置
我参考下面的源代码来创建示例.
而且我也找不到很好的例子.
我想等一个更好的.
I referred to source codes below for creating sample.
And I also could not find good examples.
I would like to wait better one.
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https ://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
这篇关于如何使用`ssc.fileStream()`读取实木复合地板文件?传递给`ssc.fileStream()`的类型是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!