Spark 2.0:绝对URI中的相对路径(火花仓库) [英] Spark 2.0: Relative path in absolute URI (spark-warehouse)

查看:411
本文介绍了Spark 2.0:绝对URI中的相对路径(火花仓库)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Spark 1.6.1迁移到Spark 2.0.0,并且在尝试将csv文件读入SparkSQL时遇到奇怪的错误.以前,当我从pyspark中的本地磁盘读取文件时,我会这样做:

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do:

Spark 1.6

df = sqlContext.read \
        .format('com.databricks.spark.csv') \
        .option('header', 'true') \
        .load('file:///C:/path/to/my/file.csv', schema=mySchema)

在最新版本中,我认为它应该像这样:

In the latest release I think it should look like this:

Spark 2.0

spark = SparkSession.builder \
           .master('local[*]') \
           .appName('My App') \
           .getOrCreate()

df = spark.read \
        .format('csv') \
        .option('header', 'true') \
        .load('file:///C:/path/to/my/file.csv', schema=mySchema)

但是无论我尝试调整路径有多少种不同的方式,我都会遇到此错误:

But I am getting this error no matter how many different ways I try to adjust the path:

IllegalArgumentException: 'java.net.URISyntaxException: Relative path in 
absolute URI: file:/C:/path//to/my/file/spark-warehouse'

不确定这仅仅是Windows的问题还是我缺少的东西.我很高兴spark-csv软件包现在已经是Spark的一部分,但是我似乎再也无法读取它了.有什么想法吗?

Not sure if this is just an issue with Windows or there is something I am missing. I was excited that the spark-csv package is now a part of Spark right out of the box, but I can't seem to get it to read any of my local files anymore. Any ideas?

推荐答案

我能够在最新的Spark文档中进行一些深入的研究,并且我发现它们具有以前没有注意到的新配置设置:

I was able to do some digging around in the latest Spark documentation, and I notice they have a new configuration setting that I hadn't noticed before:

spark.sql.warehouse.dir

因此,我继续并在设置SparkSession时添加了此设置:

So I went ahead and added this setting when I set up my SparkSession:

spark = SparkSession.builder \
           .master('local[*]') \
           .appName('My App') \
           .config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
           .getOrCreate()

这似乎设置了工作目录,然后我可以将文件名直接输入到csv阅读器中:

That seems to set the working directory, and then I can just feed my filename directly into the csv reader:

df = spark.read \
        .format('csv') \
        .option('header', 'true') \
        .load('file.csv', schema=mySchema) 

一旦我设置了Spark仓库,Spark就能找到我的所有文件,并且我的应用程序现在成功完成.令人惊奇的是它的运行速度比Spark 1.6快20倍.因此,他们确实做了一些非常出色的工作来优化SQL引擎.火花起来!

Once I set the spark warehouse, Spark was able to locate all of my files and my app finishes successfully now. The amazing thing is that it runs about 20 times faster than it did in Spark 1.6. So they really have done some very impressive work optimizing their SQL engine. Spark it up!

这篇关于Spark 2.0:绝对URI中的相对路径(火花仓库)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆