Spark 2.0:绝对 URI 中的相对路径(spark-warehouse) [英] Spark 2.0: Relative path in absolute URI (spark-warehouse)

查看:97
本文介绍了Spark 2.0:绝对 URI 中的相对路径(spark-warehouse)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 Spark 1.6.1 迁移到 Spark 2.0.0,但在尝试将 csv 文件读入 SparkSQL 时遇到了一个奇怪的错误.以前,当我从 pyspark 中的本地磁盘读取文件时,我会这样做:

Spark 1.6

df = sqlContext.read \.format('com.databricks.spark.csv') \.option('header', 'true') \.load('file:///C:/path/to/my/file.csv', schema=mySchema)

在最新版本中,我认为它应该是这样的:

Spark 2.0

spark = SparkSession.builder \.master('本地[*]') \.appName('我的应用') \.getOrCreate()df = spark.read \.format('csv') \.option('header', 'true') \.load('file:///C:/path/to/my/file.csv', schema=mySchema)

但是无论我尝试以多少种不同的方式调整路径,我都会收到此错误:

IllegalArgumentException: 'java.net.URISyntaxException: 相对路径绝对 URI: 文件:/C:/path//to/my/file/spark-warehouse'

不确定这是否只是 Windows 的问题,还是我遗漏了什么.我很高兴 spark-csv 包现在是 Spark 开箱即用的一部分,但我似乎无法让它再读取我的任何本地文件.有什么想法吗?

解决方案

我能够在最新的 Spark 文档中进行一些挖掘,我注意到他们有一个我以前没有注意到的新配置设置:

>

spark.sql.warehouse.dir

所以我在设置 SparkSession 时继续添加了这个设置:

spark = SparkSession.builder \.master('本地[*]') \.appName('我的应用') \.config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \.getOrCreate()

这似乎设置了工作目录,然后我可以直接将我的文件名输入到 csv 阅读器中:

df = spark.read \.format('csv') \.option('header', 'true') \.load('file.csv', schema=mySchema)

一旦我设置了 Spark 仓库,Spark 就能够找到我的所有文件,我的应用程序现在成功完成.令人惊奇的是,它的运行速度比在 Spark 1.6 中快 20 倍.所以他们确实做了一些非常令人印象深刻的工作来优化他们的 SQL 引擎.点燃它!

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do:

Spark 1.6

df = sqlContext.read \
        .format('com.databricks.spark.csv') \
        .option('header', 'true') \
        .load('file:///C:/path/to/my/file.csv', schema=mySchema)

In the latest release I think it should look like this:

Spark 2.0

spark = SparkSession.builder \
           .master('local[*]') \
           .appName('My App') \
           .getOrCreate()

df = spark.read \
        .format('csv') \
        .option('header', 'true') \
        .load('file:///C:/path/to/my/file.csv', schema=mySchema)

But I am getting this error no matter how many different ways I try to adjust the path:

IllegalArgumentException: 'java.net.URISyntaxException: Relative path in 
absolute URI: file:/C:/path//to/my/file/spark-warehouse'

Not sure if this is just an issue with Windows or there is something I am missing. I was excited that the spark-csv package is now a part of Spark right out of the box, but I can't seem to get it to read any of my local files anymore. Any ideas?

解决方案

I was able to do some digging around in the latest Spark documentation, and I notice they have a new configuration setting that I hadn't noticed before:

spark.sql.warehouse.dir

So I went ahead and added this setting when I set up my SparkSession:

spark = SparkSession.builder \
           .master('local[*]') \
           .appName('My App') \
           .config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
           .getOrCreate()

That seems to set the working directory, and then I can just feed my filename directly into the csv reader:

df = spark.read \
        .format('csv') \
        .option('header', 'true') \
        .load('file.csv', schema=mySchema) 

Once I set the spark warehouse, Spark was able to locate all of my files and my app finishes successfully now. The amazing thing is that it runs about 20 times faster than it did in Spark 1.6. So they really have done some very impressive work optimizing their SQL engine. Spark it up!

这篇关于Spark 2.0:绝对 URI 中的相对路径(spark-warehouse)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆