如何在R中使用sparklyr读取S3文件夹/存储桶中的所有文件? [英] How to read all files in S3 folder/bucket using sparklyr in R?

查看:69
本文介绍了如何在R中使用sparklyr读取S3文件夹/存储桶中的所有文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了以下代码&它的组合以便读取S3文件夹中给定的所有文件,但似乎无济于事..下面的脚本中删除了敏感信息/代码.每个文件有6个,大小为6.5 GB.

I have tried below code & its combinations in order to read all files given in a S3 folder , but nothing seems to be working .. Sensitive information/code is removed from the below script. There are 6 files each with 6.5 GB .

#Spark Connection
sc<-spark_connect(master = "local" , config=config)


rd_1<-spark_read_csv(sc,name = "Retail_1",path = "s3a://mybucket/xyzabc/Retail_Industry/*/*",header = F,delimiter = "|")


# This is the S3 bucket/folder for files [One of the file names Industry_Raw_Data_000]
s3://mybucket/xyzabc/Retail_Industry/Industry_Raw_Data_000

这是我得到的错误

Error: org.apache.spark.sql.AnalysisException: Path does not exist: s3a://mybucket/xyzabc/Retail_Industry/*/*;
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:710)

推荐答案

花了几个星期在谷歌上搜索了该问题之后,问题才得以解决.在这里,解决方案..

After spending few weeks on googling that issue , it is solved . Here ,the solution..

Sys.setenv(AWS_ACCESS_KEY_ID="abc") 
Sys.setenv(AWS_SECRET_ACCESS_KEY="xyz")

config<-spark_config()

config$sparklyr.defaultPackages <- c(
"com.databricks:spark-csv_2.10:1.5.0",
"com.amazonaws:aws-java-sdk-pom:1.10.34",
"org.apache.hadoop:hadoop-aws:2.7.3")



#Spark Connection
sc<-spark_connect(master = "local" , config=config)

# hadoop configurations
ctx <- spark_context(sc)
jsc <- invoke_static( sc,
"org.apache.spark.api.java.JavaSparkContext",
"fromSparkContext",
ctx
)

hconf <- jsc %>% invoke("hadoopConfiguration")  
hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
hconf %>% invoke("set", "fs.s3a.fast.upload", "true")

folder_files<-"s3a://mybucket/abc/xyz"

rd_11<-spark_read_csv(sc,name = "Retail",path=folder_files,infer_schema = TRUE,header = F,delimiter = "|")


spark_disconnect(sc)

这篇关于如何在R中使用sparklyr读取S3文件夹/存储桶中的所有文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆