Apache Spark 2.0(PySpark)-DataFrame错误为csv找到多个源 [英] Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

查看:186
本文介绍了Apache Spark 2.0(PySpark)-DataFrame错误为csv找到多个源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Spark 2.0中使用以下代码创建数据框.在Jupyter/Console中执行代码时,我遇到以下错误.有人可以帮我摆脱这个错误吗?

I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error?

错误:

Py4JJavaError:调用o34.csv时发生错误.:java.lang.RuntimeException:为csv找到了多个源(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,com.databricks.spark.csv.DefaultSource15),请指定完全限定的类名.在scala.sys.package $ .error(package.scala:27)

Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27)

代码:

   from pyspark.sql import SparkSession
   if __name__ == "__main__":
      session = SparkSession.builder.master('local')
                     .appName("RealEstateSurvey").getOrCreate()
      df = session \
           .read \
           .option("inferSchema", value = True) \
           .option('header','true') \
           .csv("/home/senthiljdpm/RealEstate.csv")

     print("=== Print out schema ===")
     session.stop()

推荐答案

该错误是因为您必须同时拥有两个库( org.apache.spark.sql.execution.datasources.csv.CSVFileFormat 和您的类路径中的 com.databricks.spark.csv.DefaultSource ).而且Spark感到困惑,该选择哪一个.

The error is because you must have both libraries (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat and com.databricks.spark.csv.DefaultSource) in your classpath. And spark got confused which one to choose.

您所需要的只是通过将 format 选项定义为

All you need is tell spark to use com.databricks.spark.csv.DefaultSource by defining format option as

  df = session \
       .read \
       .format("com.databricks.spark.csv") \
       .option("inferSchema", value = True) \
       .option('header','true') \
       .csv("/home/senthiljdpm/RealEstate.csv")

另一种替代方法是将 load 用作

Another alternative is to use load as

  df = session \
       .read \
       .format("com.databricks.spark.csv") \
       .option("inferSchema", value = True) \
       .option('header','true') \
       .load("/home/senthiljdpm/RealEstate.csv")

这篇关于Apache Spark 2.0(PySpark)-DataFrame错误为csv找到多个源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆