从DataFrame写入后将CSV数据读入SparkR [英] Reading csv data into SparkR after writing it out from a DataFrame

查看:177
本文介绍了从DataFrame写入后将CSV数据读入SparkR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我按照这篇文章中的示例进行了写出 DataFrame 作为AWS S3存储桶中的csv。结果不是一个文件,而是一个包含许多.csv文件的文件夹。我现在在SparkR中以 DataFrame 的形式读取此文件夹时遇到麻烦。以下是我尝试过的方法,但是它们并不会导致我写出相同的 DataFrame

  write.df(df,'s3a:// bucket / df',source = csv)#创建一个文件夹在S3存储桶中命名为df 

df_in1<-read.df( s3a:// bucket / df,source = csv)
df_in2<-read.df( s3a://bucket/df/*.csv,source = csv)
#df_in1或df_in2均不会导致与df
相同的数据帧pre>

解决方案

 #在此示例中使用了Spark 1.4 

#从https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv以CSV格式下载nyc航班数据集

#使用$ b $启动SparkR b#./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3

#SparkSQL上下文应该已经为您创建为sqlContext
sqlContext
#Java引用类型org.apache.spark.sql.SQLContext id 1

#使用`read.df`加载排期CSV文件。请注意,我们在这里使用CSV阅读器Spark软件包。
航班<-read.df(sqlContext, ./nycflights13.csv, com.databricks.spark.csv,header = true)

#打印第一个几行
头(航班)

希望此示例有所帮助。


I followed the example in this post to write out a DataFrame as a csv to an AWS S3 bucket. The result was not a single file but rather a folder with many .csv files. I'm now having trouble reading in this folder as a DataFrame in SparkR. Below is what I've tried but they do not result in the same DataFrame that I wrote out.

write.df(df, 's3a://bucket/df', source="csv") #Creates a folder named df in S3 bucket

df_in1 <- read.df("s3a://bucket/df", source="csv")
df_in2 <- read.df("s3a://bucket/df/*.csv", source="csv")
#Neither df_in1 or df_in2 result in DataFrames that are the same as df

解决方案

#  Spark 1.4 is used in this example
# 
# Download the nyc flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv

# Launch SparkR using 
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3

# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1

# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="true")

# Print the first few rows
head(flights)

Hope this example helps.

这篇关于从DataFrame写入后将CSV数据读入SparkR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆