Spark 2.1结构化流-使用Kakfa作为Python的源(pyspark) [英] Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)
问题描述
对于Apache Spark 2.1版,我想将Kafka(0.10.0.2.5)用作pyspark的结构化流的来源:
With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark:
kafka_app.py:
kafka_app.py:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("TestKakfa").getOrCreate()
kafka=spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers","localhost:6667") \
.option("subscribe","mytopic").load()
我通过以下方式启动了该应用程序:
I launched the app in the following way:
./bin/spark-submit kafka_app.py --master local[4] --jars spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar
从mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10-assembly_2.10/2.1.0下载.jar后
After having downloaded the .jar from mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10-assembly_2.10/2.1.0
我得到这样的错误:
[...] java.lang.ClassNotFoundException:Failed to find data source: kakfa. [...]
Similarly, I cannot run the Spark example of integration with Kakfa : https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html
所以我想知道我错在哪里或者实际上是否支持使用pyspark将Kafka与Spark 2.1集成,因为此页面仅提及Scala和Java作为版本0.10中受支持的语言,这使我产生疑问:
So I wonder where I am wrong or whether Kafka integration with Spark 2.1 using pyspark is actually supported as this page mentioning only Scala and Java as supported language in the version 0.10 makes me doubt : https://spark.apache.org/docs/latest/streaming-kafka-integration.html (But if not supported yet, why an example in Python was published ?)
在此先感谢您的帮助!
推荐答案
您需要使用sql结构的流jar"spark-sql-kafka-0-10_2.11-2.1.0.jar",而不是spark-streaming -kafka-0-10-assembly_2.10-2.1.0.jar.
You need to use sql-structured streaming jar "spark-sql-kafka-0-10_2.11-2.1.0.jar" instead of spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar.
这篇关于Spark 2.1结构化流-使用Kakfa作为Python的源(pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!