Apache Spark 2.0 (PySpark) - 为 csv 找到多个数据帧错误源 [英] Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

查看:31
本文介绍了Apache Spark 2.0 (PySpark) - 为 csv 找到多个数据帧错误源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Spark 2.0 中的以下代码创建数据帧.在 Jupyter/Console 中执行代码时,我面临以下错误.有人能帮我解决这个错误吗?

I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error?

错误:

Py4JJavaError:调用 o34.csv 时出错.: java.lang.RuntimeException: 为 csv 找到多个源(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15),请指定完全限定的类名.在 scala.sys.package$.error(package.scala:27​​)

Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27)

代码:

   from pyspark.sql import SparkSession
   if __name__ == "__main__":
      session = SparkSession.builder.master('local')
                     .appName("RealEstateSurvey").getOrCreate()
      df = session \
           .read \
           .option("inferSchema", value = True) \
           .option('header','true') \
           .csv("/home/senthiljdpm/RealEstate.csv")

     print("=== Print out schema ===")
     session.stop()

推荐答案

该错误是因为您必须同时拥有两个库 (org.apache.spark.sql.execution.datasources.csv.CSVFileFormatcom.databricks.spark.csv.DefaultSource) 在您的类路径中.Spark 不知道该选择哪一个.

The error is because you must have both libraries (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat and com.databricks.spark.csv.DefaultSource) in your classpath. And spark got confused which one to choose.

您只需要通过将 format 选项定义为

All you need is tell spark to use com.databricks.spark.csv.DefaultSource by defining format option as

  df = session \
       .read \
       .format("com.databricks.spark.csv") \
       .option("inferSchema", value = True) \
       .option('header','true') \
       .csv("/home/senthiljdpm/RealEstate.csv")

另一种选择是使用 load 作为

Another alternative is to use load as

  df = session \
       .read \
       .format("com.databricks.spark.csv") \
       .option("inferSchema", value = True) \
       .option('header','true') \
       .load("/home/senthiljdpm/RealEstate.csv")

这篇关于Apache Spark 2.0 (PySpark) - 为 csv 找到多个数据帧错误源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆