如何使用pyspark在s3上获取csv(方案的无文件系统:s3n) [英] How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

查看:134
本文介绍了如何使用pyspark在s3上获取csv(方案的无文件系统:s3n)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于SO有很多类似的问题,但是我根本无法解决这个问题.我显然丢失了一些东西.

There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.

试图从我的s3加载一个简单的测试CSV文件.

Trying to load a simple test csv file from my s3.

可以像下面这样在本地进行操作.

Doing it locally, like below, works.

from pyspark.sql import SparkSession
from pyspark import SparkContext as sc

logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

但是,如果我在下面添加它:

But if I add this below:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()

我得到:

No FileSystem for scheme: s3n

我还尝试了将s3更改为spark.sparkContext且没有任何区别

I've also tried changing s3 to spark.sparkContext without any difference

也在网址中交换/////

更好的是,我宁愿这样做,直接进入数据框:

Even better, I'd rather do this and go straight to data frame:

dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")

我对AWS还是一无所知,因此我尝试了s3,s3n和s3a无济于事.

Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.

我曾经在互联网上回过头来,但似乎无法解决方案错误.谢谢!

I've been around the internet and back but can't seem to resolve the scheme error. Thanks!

推荐答案

我认为您的Spark环境没有AWS jars.您需要添加它才能使用s3或s3n.

I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.

您必须从hadoop下载中将所需的jar文件复制到$ SPARK_HOME/jars目录中.使用--jars标志或--packages标志进行spark-submit无效.

You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.

这里我的Spark版本是Spark 2.3.0Hadoop 2.7.6 所以你必须从(hadoop dir)/share/hadoop/tools/lib/复制到jars 到$SPARK_HOME/jars.

Here my spark version is Spark 2.3.0 and Hadoop 2.7.6 so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/ to $SPARK_HOME/jars.

aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar

这篇关于如何使用pyspark在s3上获取csv(方案的无文件系统:s3n)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆