使用spark_read_csv()从Rstudio Server中的Azure Blob存储读取CSV文件 [英] Read CSV file from Azure Blob storage in Rstudio Server with spark_read_csv()

查看:100
本文介绍了使用spark_read_csv()从Rstudio Server中的Azure Blob存储读取CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在带有Java 8 HDI 3.6的Spark 2.2上配置了Azure HDInsight群集类型ML Services(R Server),操作系统Linux,版本ML Services 9.3.

在Rstudio Server中,我试图从blob存储中读取一个csv文件.

  Sys.setenv(SPARK_HOME ="/usr/hdp/current/spark-client")Sys.setenv(YARN_CONF_DIR ="/etc/hadoop/conf")Sys.setenv(HADOOP_CONF_DIR ="/etc/hadoop/conf")Sys.setenv(SPARK_CONF_DIR ="/etc/spark/conf")选项(rsparkling.sparklingwater.version ="2.2.28")图书馆(sparklyr)图书馆(dplyr)图书馆(h2o)图书馆(闪闪发光)sc<-spark_connect(master ="yarn-client",版本="2.2.0")来源< -file.path("wasb://MYDefaultContainer@MyStorageAccount.blob.core.windows.net",用户/RevoShare")df2<-spark_read_csv(sc,路径=起源,名称="Nov-MD-Dan",内存= FALSE) 

运行此命令时出现以下错误

 错误:java.lang.IllegalArgumentException:无效的方法csv用于对象235在sparklyr.Invoke $ .invoke(invoke.scala:122)在sparklyr.StreamHandler $ .handleMethodCall(stream.scala:97)在sparklyr.StreamHandler $ .read(stream.scala:62)在sparklyr.BackendHandler.channelRead0(handler.scala:52)在sparklyr.BackendHandler.channelRead0(handler.scala:14)在io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleCh annelInboundHandler.java:105)在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)在io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)在io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)在io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)在io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)在io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)在io.netty.channel.nio.AbstractNioByteChannel $ NioByteUnsafe.read(AbstractNioByteChannel.java:131)在io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)在io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)在io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)在io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)在io.netty.util.concurrent.SingleThreadEventExecutor $ 2.run(SingleThreadEventExecutor.java:111)在io.netty.util.concurrent.DefaultThreadFactory $ DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)在java.lang.Thread.run(Thread.java:748) 

任何帮助都会很棒!

解决方案

路径起源应该指向CSV文件或CSV目录.您确定 origins 指向文件目录还是文件?通常,每个HDFS用户在/user/RevoShare/下至少还有一个目录,即/user/RevoShare/sshuser/.

以下示例可能会有所帮助:

  sample_file<-file.path("/example/data/","yellowthings.txt")图书馆(sparklyr)图书馆(dplyr)cc<-rxSparkConnect(interop ="sparklyr")sc<-rxGetSparklyrConnection(cc)水果<-spark_read_csv(sc,path = sample_file,name ="fruits",标头= FALSE) 

您可以使用 RxHadoopListFiles("/example/data/")或使用 hdfs dfs -ls/example/data 检查HDFS/Blob上的目录./p>

HTH!

I have provisioned an Azure HDInsight cluster type ML Services (R Server), operating system Linux, version ML Services 9.3 on Spark 2.2 with Java 8 HDI 3.6.

Within Rstudio Server I am trying to read in a csv file from my blob storage.

Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
Sys.setenv(YARN_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(SPARK_CONF_DIR="/etc/spark/conf")

options(rsparkling.sparklingwater.version = "2.2.28")

library(sparklyr)
library(dplyr)
library(h2o)
library(rsparkling)


sc <- spark_connect(master = "yarn-client",
                    version = "2.2.0")

origins <-file.path("wasb://MYDefaultContainer@MyStorageAccount.blob.core.windows.net",
                 "user/RevoShare")

df2 <- spark_read_csv(sc,
                 path = origins,
                 name = 'Nov-MD-Dan',
                 memory = FALSE)```

When I run this I get the following error

Error: java.lang.IllegalArgumentException: invalid method csv 
for object 235
at sparklyr.Invoke$.invoke(invoke.scala:122)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
at sparklyr.StreamHandler$.read(stream.scala:62)
at sparklyr.BackendHandler.channelRead0(handler.scala:52)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at 

io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleCh   annelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:748)

Any help would be awesome!

解决方案

The path origins should point to a CSV file or a directory of CSVs. Are you sure that origins points to a directory of files or a file? There's typically at least one more directory under /user/RevoShare/ for each HDFS user, i.e., /user/RevoShare/sshuser/.

Here's an example that may help:

sample_file <- file.path("/example/data/", "yellowthings.txt")

library(sparklyr)
library(dplyr)
cc <- rxSparkConnect(interop = "sparklyr")
sc <- rxGetSparklyrConnection(cc)

fruits <- spark_read_csv(sc, path = sample_file, name = "fruits", header = FALSE)

You can use RxHadoopListFiles("/example/data/") or use hdfs dfs -ls /example/data to inspect your directories on HDFS / Blob.

HTH!

这篇关于使用spark_read_csv()从Rstudio Server中的Azure Blob存储读取CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆