Apache Spark结构化流(DataStreamWriter)写入Hive表 [英] Apache Spark Structured Streaming (DataStreamWriter) write to Hive table
问题描述
我希望使用Spark结构化流技术从Kafka中读取数据并进行处理,然后将其写入Hive表中.
I am looking to use Spark Structured streaming to read data from Kafka and process it and write to Hive table.
val spark = SparkSession
.builder
.appName("Kafka Test")
.config("spark.sql.streaming.metricsEnabled", true)
.config("spark.streaming.backpressure.enabled", "true")
.enableHiveSupport()
.getOrCreate()
val events = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxxxxxx")
.option("startingOffsets", "latest")
.option("subscribe", "yyyyyy")
.load
val data = events.select(.....some columns...)
data.writeStream
.format("parquet")
.option("compression", "snappy")
.outputMode("append")
.partitionBy("ds")
.option("path", "maprfs:/xxxxxxx")
.start()
.awaitTermination()
这确实创建了一个镶木地板文件,但是我如何更改它以模仿类似内容,以便它写入表格式,可以使用(select * from)从hive或spark-sql中读取
This does create a parquet files, however how do I change this to mimic something like, so that it writes into table format which can be read from hive or spark-sql using (select * from)
data.write.format("parquet").option("compression", "snappy").mode("append").partitionBy("ds").saveAsTable("xxxxxx")
推荐答案
我建议您查看独立或作为
I would recommend looking at Kafka Connect for writing the data to HDFS. It is open source and available standalone or as part of Confluent Platform.
要过滤和转换数据,可以使用 Kafka流或 KSQL . KSQL在Kafka Streams之上运行,并为您提供了一种非常简单的方式来连接数据,过滤数据并建立聚合.
For filtering and transforming the data you could use Kafka Streams, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to join data, filter it, and build aggregations.
这是在KSQL中汇总数据流的示例
Here's an example of doing aggregations of streams of data in KSQL
SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID
在这次谈话关于使用这些组件构建流数据管道
See KSQL in action in this blog. You might also be interested in this talk about building streaming data pipelines with these components
这篇关于Apache Spark结构化流(DataStreamWriter)写入Hive表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!