(;:// ... S3N&QUOT)星火使用sc.textFile从S3读取文件 [英] Spark read file from S3 using sc.textFile ("s3n://...)

查看:521
本文介绍了(;:// ... S3N&QUOT)星火使用sc.textFile从S3读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用火花壳读取位于S3文件:

Trying to read a file located in S3 using spark-shell:

scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12

scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    ... etc ...

IOException异常:没有文件系统的计划:与发生S3N 错误:


  • 星火开发计算机上1.31或1.40(不Hadoop的库)

  • 从运行 Hortonworks沙盒HDP V2.2.4 (Hadoop的2.60),它集成星火1.2.1开箱

  • 使用S3://或S3N://方案

  • Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
  • Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
  • Using s3:// or s3n:// scheme

什么是此错误的原因是什么?缺失的依赖,缺少配置或的误用sc.textFile()

What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?

或者可能这是由于影响特定星火建于Hadoop的2.60,因为这<错误一href=\"http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-3-1-Hadoop-2-6-package-has-broken-S3-access-td12107.html\">post似乎暗示。我要尝试星火Hadoop的2.40,看看是否能解决这个问题。

Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.

推荐答案

确认这筑起的Hadoop 2.60是有关的火花。刚装星火1.4.0Hadoop的2.4内置pre和后来的(而不是Hadoop的2.6)。而code现在工程确定。

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.

sc.textFile(S3N:// bucketname /文件名)现在提出了另一个错误:

sc.textFile("s3n://bucketname/Filename") now raises another error:

java.lang.IllegalArgumentException异常:AWS访问密钥ID和秘密访问键必须被指定为S3N URL的用户名和密码(分别),或通过设置fs.s3n.awsAccessKeyId或fs.s3n.awsSecretAccessKey属性(分别)。

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

下code使用S3 URL格式来显示,星火可以阅读S3文件。使用开发机(无Hadoop的库)。

The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).

scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21

scala> lyrics.count
res1: Long = 9

甚至更好:在code以上,如果AWS秘密密钥具有了/AWS凭据内嵌在S3N URI将打破。在SparkContext配置AWS凭据将修复它。 code的工作S3文件是否是公共或私有的。

Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count

这篇关于(;:// ... S3N&QUOT)星火使用sc.textFile从S3读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆