Spark结构化流到kudu上下文 [英] Spark structured stream to kudu context
本文介绍了Spark结构化流到kudu上下文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想阅读kafka主题,然后通过Spark Streaming将其写入kudu表.
I want to read kafka topic then write it to kudu table by spark streaming.
// sessions and contexts
val conf = new SparkConf().setMaster("local[2]").setAppName("TestMain")
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
val sparkContext = sparkSession.sparkContext
val kuduContext = new KuduContext("...", sparkContext);
// structure
val schema: StructType = StructType(
StructField("userNo", IntegerType, true) ::
StructField("bandNo", IntegerType, false) ::
StructField("ipv4", StringType, false) :: Nil);
// kudu - prepare table
kuduContext.deleteTable("test_table");
kuduContext.createTable("test_table", schema, Seq("userNo"), new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("userNo").asJava, 3))
// get stream from kafka
val parsed = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("startingOffsets", "latest")
.option("subscribe", "feed_api_band_get_popular_post_list")
.load()
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
// write it to kudu
kuduContext.insertRows(parsed.toDF(), "test_table");
现在它抱怨
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
我的第二种方法
似乎我更改了代码以使用传统的KafkaUtils.createDirectStream
My second approach
It seems I change my code to use traditional KafkaUtils.createDirectStream
KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
).foreachRDD(rdd => {
rdd.foreach(record => {
// write to kudu.............
println(record.value());
})
});
ssc.start();
ssc.awaitTermination();
那么,哪种方法是正确的?还是有什么方法可以使其从第一种方法运行?
So, which one is right approach? or is there any way to make it run from first approach?
Spark版本是2.2.0.
Spark version is 2.2.0.
推荐答案
两种方法似乎都是正确的.第一个使用Spark结构化流处理方式,其中以表格形式附加数据.第二种方法是通过传统的DStream做事方式
Both the approaches seem right. First one uses the Spark Structured streaming way of doing things wherein the data is appended on a tabular basis. Second method does it via traditional DStream way of doing things
这篇关于Spark结构化流到kudu上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文