Spark 使用 sc.textFile ("s3n://...) 从 S3 读取文件 [英] Spark read file from S3 using sc.textFile ("s3n://...)

查看:33
本文介绍了Spark 使用 sc.textFile ("s3n://...) 从 S3 读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用 spark-shell 读取位于 S3 中的文件:

Trying to read a file located in S3 using spark-shell:

scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12

scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    ... etc ...

IOException: No FileSystem for scheme: s3n 发生错误:

  • 开发机器上的 Spark 1.31 或 1.40(无 Hadoop 库)
  • Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) 运行,它集成了开箱即用的 Spark 1.2.1
  • 使用 s3://或 s3n://方案
  • Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
  • Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
  • Using s3:// or s3n:// scheme

这个错误的原因是什么?sc.textFile() 缺少依赖项、缺少配置或误用?

What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?

或者这可能是由于影响特定于 Hadoop 2.60 的 Spark 构建的错误造成的,例如 post 似乎暗示了这一点.我将尝试使用 Spark for Hadoop 2.40,看看是否能解决问题.

Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.

推荐答案

确认这与针对 Hadoop 2.60 的 Spark 构建有关.刚刚安装了 Spark 1.4.0为 Hadoop 2.4 及更高版本预构建"(而不是 Hadoop 2.6).代码现在可以正常工作了.

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.

sc.textFile("s3n://bucketname/Filename") 现在引发另一个错误:

sc.textFile("s3n://bucketname/Filename") now raises another error:

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

下面的代码使用 S3 URL 格式显示 Spark 可以读取 S3 文件.使用开发机器(无 Hadoop 库).

The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).

scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21

scala> lyrics.count
res1: Long = 9

更好:如果 AWS 密钥具有前向/",则上述在 S3N URI 中内嵌 AWS 凭证的代码将中断.在 SparkContext 中配置 AWS 凭证将修复它.无论 S3 文件是公开的还是私有的,代码都可以工作.

Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count

这篇关于Spark 使用 sc.textFile (&quot;s3n://...) 从 S3 读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆