Apache Spark结构化流(DataStreamWriter)写入Hive表 [英] Apache Spark Structured Streaming (DataStreamWriter) write to Hive table

查看:524
本文介绍了Apache Spark结构化流(DataStreamWriter)写入Hive表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用Spark结构化流技术从Kafka中读取数据并进行处理,然后将其写入Hive表中.

I am looking to use Spark Structured streaming to read data from Kafka and process it and write to Hive table.

 val spark = SparkSession
   .builder
   .appName("Kafka Test")
   .config("spark.sql.streaming.metricsEnabled", true)
   .config("spark.streaming.backpressure.enabled", "true")
   .enableHiveSupport()
   .getOrCreate()

val events = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "xxxxxxx")
  .option("startingOffsets", "latest")
  .option("subscribe", "yyyyyy")
  .load


val data = events.select(.....some columns...)

data.writeStream
  .format("parquet")
  .option("compression", "snappy")
  .outputMode("append")
  .partitionBy("ds")
  .option("path", "maprfs:/xxxxxxx")
  .start()
  .awaitTermination()

这确实创建了一个镶木地板文件,但是我如何更改它以模仿类似内容,以便它写入表格式,可以使用(select * from)从hive或spark-sql中读取

This does create a parquet files, however how do I change this to mimic something like, so that it writes into table format which can be read from hive or spark-sql using (select * from)

data.write.format("parquet").option("compression", "snappy").mode("append").partitionBy("ds").saveAsTable("xxxxxx")

推荐答案

我建议您查看独立或作为

I would recommend looking at Kafka Connect for writing the data to HDFS. It is open source and available standalone or as part of Confluent Platform.

要过滤和转换数据,可以使用 Kafka流 KSQL . KSQL在Kafka Streams之上运行,并为您提供了一种非常简单的方式来连接数据,过滤数据并建立聚合.

For filtering and transforming the data you could use Kafka Streams, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to join data, filter it, and build aggregations.

这是在KSQL中汇总数据流的示例

Here's an example of doing aggregations of streams of data in KSQL

SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID

这次谈话关于使用这些组件构建流数据管道

See KSQL in action in this blog. You might also be interested in this talk about building streaming data pipelines with these components

这篇关于Apache Spark结构化流(DataStreamWriter)写入Hive表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆