Spark Kafka数据消费包 [英] Spark Kafka Data Consuming Package

查看:25
本文介绍了Spark Kafka数据消费包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用文档中提到的以下代码来使用我的 kafka 主题:

I tried to consume my kafka topic with the code below as mentioned in documentations:

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092,") \
  .option("subscribe", "first_topic") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

然后我收到错误:

AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

所以我尝试了:

./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 ...

安装 kafka 包及其依赖项.但我收到此错误:

to install the kafka package and it's dependencies. but I get this error:

21/06/21 13:45:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'File file:/home/soheil/spark-3.1.2-bin-hadoop3.2/... does not exist'.  Please specify one with --class.
    at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:968)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:486)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我应该怎么做才能安装这个包?

what should I do to install this package?

推荐答案

你在这里得到的错误与 Kafka 无关

The error you're getting here is not related to Kafka

file:/home/soheil/spark-3.1.2-bin-hadoop3.2/...不存在

这是在 Spark 所依赖的 PATH 上引用您的 HADOOP_HOME 和/或 HADOOP_CONF_DIR 变量.检查这些配置是否正确,并且您可以运行 在运行您自己的脚本之前使用 Kafka 的 Spark 结构化流 WordCount 示例.

This is referencing your HADOOP_HOME and/or HADOOP_CONF_DIR variables on your PATH that Spark depends on. Check these are configured correctly and that you can run the Spark Structured Streaming WordCount examples that use Kafka before running your own scripts.

$ bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
     structured_kafka_wordcount.py \
     host1:port1,host2:port2 subscribe topic1,topic2

下一部分 请用 --class 指定一个. 表示 CLI 解析器失败;可能是因为您输入了 spark-submit 选项或文件路径中有空格,例如

The next part Please specify one with --class. is saying that the CLI parser failed; probably because you mistyped the spark-submit options or there is a space in your filepaths, for example

这篇关于Spark Kafka数据消费包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆