R从S3读取ORC文件 [英] R read ORC file from S3

查看:198
本文介绍了R从S3读取ORC文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们将在运行在S3存储桶顶部的AWS上托管EMR集群(带有竞价型实例).数据将以ORC格式存储在此存储桶中.但是,我们希望使用R以及某种沙盒环境来读取相同的数据.

We will be hosting an EMR cluster (with spot instances) on AWS running on top of an S3 bucket. Data will be stored in this bucket in ORC format. However, we want to use R as well as some kind of a sandbox environment, reading the same data.

我已经正确运行了aws.s3(cloudyr)软件包:我可以毫无问题地读取csv文件,但似乎不允许我将orc文件转换为可读的文件.

I've got the package aws.s3 (cloudyr) running correctly: I can read csv files without a problem, but it seems not to allow me to convert the orc files into something readable.

我在网上发现的两个选项是 -火花 -数据连接器(vertica)

The two options I founnd online were - SparkR - dataconnector (vertica)

由于在Windows计算机上安装dataconnector很麻烦,因此我安装了SparkR,现在可以读取本地orc.file(计算机上的R local,计算机上的orc file).但是,如果我尝试使用read.orc,则默认情况下它会将我的路径归一化为本地路径.深入研究源代码,我运行了以下代码:

Since installing dataconnector on Windows machine was problamatic, I installed SparkR and I am now able to read a local orc.file (R local on my machine, orc file local on my machine). However if i try read.orc, it by default normalizes my path to a local path. Digging into the source code, I ran the following:

sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", my_path)

但是我得到了以下错误:

But I obtained the following error:

Error: Error in orc : java.io.IOException: No FileSystem for scheme: https

有人可以帮助我解决这个问题,还是指向从S3加载orc文件的另一种方法?

Could someone help me either with this problem or pointing to an alternative way to load orc files from S3?

推荐答案

编辑后的答案:现在您可以直接从S3中读取内容,而无需先从本地文件系统下载并读取内容

根据mrjoseph的要求:通过SparkR的可能解决方案(首先我不想这样做).

On request of mrjoseph: a possible solution via SparkR (which in the first place I did not want to do).

# Set the System environment variable to where Spark is installed
Sys.setenv(SPARK_HOME="pathToSpark")
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "org.apache.hadoop:hadoop-aws:2.7.1" "sparkr-shell"')

# Set the library path to include path to SparkR
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))

# Set system environments to be able to load from S3
Sys.setenv("AWS_ACCESS_KEY_ID" = "myKeyID", "AWS_SECRET_ACCESS_KEY" = "myKey", "AWS_DEFAULT_REGION" = "myRegion")

# load required packages
library(aws.s3)
library(SparkR)

## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)

# Set path to file
path <- "s3n://bucketname/filename.orc"

# Set hadoop configuration
hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
SparkR:::callJMethod(hConf, "set", "fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsAccessKeyId", "myAccesKey")
SparkR:::callJMethod(hConf, "set", "fs.s3n.awsSecretAccessKey", "mySecrectKey")

# Slight adaptation to read.orc function
sparkSession <- SparkR:::getSparkSession()
options <- SparkR:::varargsToStrEnv()
# Not required: path <- normalizePath(path)
read <- SparkR:::callJMethod(sparkSession, "read")
read <- SparkR:::callJMethod(read, "options", options)
sdf <- SparkR:::handledCallJMethod(read, "orc", path)
temp <- SparkR:::dataFrame(sdf)

# Read first lines
head(temp)

这篇关于R从S3读取ORC文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆