如何在以本地模式运行的 pyspark 中从 S3 读取? [英] How can I read from S3 in pyspark running in local mode?

查看:43
本文介绍了如何在以本地模式运行的 pyspark 中从 S3 读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PyCharm 2018.1,使用 Python 3.4 和通过 pip 在 vi​​rtualenv 中安装的 Spark 2.3.本地主机上没有安装hadoop,所以没有安装Spark(因此没有SPARK_HOME、HADOOP_HOME等)

I am using PyCharm 2018.1 using Python 3.4 with Spark 2.3 installed via pip in a virtualenv. There is no hadoop installation on the local host, so there is no Spark installation (thus no SPARK_HOME, HADOOP_HOME, etc.)

当我尝试这个时:

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

我明白了:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3

如何在本地模式下运行 pyspark 时从 s3 读取数据,而无需在本地安装完整的 Hadoop?

How can I read from s3 while running pyspark in local mode without a complete Hadoop install locally?

FWIW - 当我以非本地模式在 EMR 节点上执行它时,这非常有效.

FWIW - this works great when I execute it on an EMR node in non-local mode.

以下不起作用(同样的错误,但它确实解决并下载了依赖项):

The following does not work (same error, although it does resolve and download the dependancies):

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

相同(坏)结果:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

推荐答案

所以 Glennie 的回答很接近,但对您的情况不适用.关键是选择正确版本的依赖项.如果你看虚拟环境

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

一切都指向一个2.7.3的版本,你也需要使用这个版本

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

您应该通过检查项目虚拟环境中的路径 venv/Lib/site-packages/pyspark/jars 来验证您的安装使用的版本

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

然后你可以默认使用 s3as3 通过定义相同的处理程序类

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")

print(s3File.count())
print(s3File.id())

输出如下

这篇关于如何在以本地模式运行的 pyspark 中从 S3 读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆