Pyspark无法找到数据源:kafka [英] Pyspark Failed to find data source: kafka

查看：296 发布时间：2021/4/8 18:45:20 apache-spark pyspark apache-kafka spark-streaming-kafka

本文介绍了Pyspark无法找到数据源:kafka的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究Kafka流，并尝试将其与Apache Spark集成.但是，在运行时，我遇到了问题.我收到以下错误.

I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error.

这是我正在使用的命令.

This is the command I am using.

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers"，"localhost:9092").option("subscribe"，"taxirides").load()

错误:

Py4JJavaError:调用o77.load时发生错误:java.lang.ClassNotFoundException:无法找到数据源:kafka.请在 http://spark.apache.org/third-party-projects中找到软件包.html

Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

我该如何解决?

注意:我正在Jupyter Notebook中运行它

NOTE: I am running this in Jupyter Notebook

findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

一切正常，直到此处(代码上方)

Everything is running fine till here (above code)

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers"，"localhost:9092").option("subscribe"，"taxirides").load()

这是出问题的地方(代码上方).

This is where things are going wrong (above code).

我关注的博客: https://www.adaltas.com/cn/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

推荐答案

目前尚不清楚如何运行代码.继续阅读博客，您会看到

It's not clear how you ran the code. Keep reading the blog, and you see

spark-submit \
  ...
  --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \
  sstreaming-spark-out.py

似乎您错过了添加-packages 标志

在Jupyter中，您可以添加此

In Jupyter, you could add this

import os

# setup arguments
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'

# initialize spark
import pyspark
findspark.init()

注意: _2.11:2.4.0 需要与您的Scala和Spark版本保持一致...根据问题，您的应该是Spark 2.1.0

Note: _2.11:2.4.0 need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

这篇关于Pyspark无法找到数据源:kafka的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark无法找到数据源:kafka [英] Pyspark Failed to find data source: kafka

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark无法找到数据源:kafka [英] Pyspark Failed to find data source: kafka

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭