Apache Spark 2.0 (PySpark) - 为 csv 找到多个数据帧错误源 [英] Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv
问题描述
我正在尝试使用 Spark 2.0 中的以下代码创建数据帧.在 Jupyter/Console 中执行代码时,我面临以下错误.有人能帮我解决这个错误吗?
I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error?
错误:
Py4JJavaError:调用 o34.csv 时出错.: java.lang.RuntimeException: 为 csv 找到多个源(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15),请指定完全限定的类名.在 scala.sys.package$.error(package.scala:27)
Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27)
代码:
from pyspark.sql import SparkSession
if __name__ == "__main__":
session = SparkSession.builder.master('local')
.appName("RealEstateSurvey").getOrCreate()
df = session \
.read \
.option("inferSchema", value = True) \
.option('header','true') \
.csv("/home/senthiljdpm/RealEstate.csv")
print("=== Print out schema ===")
session.stop()
推荐答案
该错误是因为您必须同时拥有两个库 (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
和com.databricks.spark.csv.DefaultSource
) 在您的类路径中.Spark 不知道该选择哪一个.
The error is because you must have both libraries (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
and com.databricks.spark.csv.DefaultSource
) in your classpath. And spark got confused which one to choose.
您只需要通过将 format
选项定义为
All you need is tell spark to use com.databricks.spark.csv.DefaultSource
by defining format
option as
df = session \
.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", value = True) \
.option('header','true') \
.csv("/home/senthiljdpm/RealEstate.csv")
另一种选择是使用 load
作为
Another alternative is to use load
as
df = session \
.read \
.format("com.databricks.spark.csv") \
.option("inferSchema", value = True) \
.option('header','true') \
.load("/home/senthiljdpm/RealEstate.csv")
这篇关于Apache Spark 2.0 (PySpark) - 为 csv 找到多个数据帧错误源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!