SparklyR:直接转换为实木复合地板 [英] SparklyR: Convert directly to parquet
问题描述
我是集群计算的新手,目前我只在独立集群上玩( sc<-spark_connect(master ="local",version ='2.0.2')
).我有一个庞大的csv文件(15GB),我想将其转换为镶木地板文件(第三段代码说明了原因).这个15GB的文件已经是60GB文件的示例,当我停止播放环绕声时,我需要使用/查询完整的60GB文件.目前我所做的是:
Hi I am new to cluster computing and currently I am only playing around on the standalone cluster (sc <- spark_connect(master = "local", version = '2.0.2')
). I have a massive csv file (15GB) which I would like to convert to a parquet file (Third chunk of code explains why). This 15GB file is already a sample of a 60GB file, and I need to use/query the full 60GB file when I stop playing arround. Currently what I did was:
> system.time({FILE<-spark_read_csv(sc,"FILE",file.path("DATA/FILE.csv"),memory = FALSE)})
user system elapsed
0.16 0.04 1017.11
> system.time({spark_write_parquet(FILE, file.path("DATA/FILE.parquet"),mode='overwrite')})
user system elapsed
0.92 1.48 1267.72
> system.time({FILE<-spark_read_parquet(sc,"FILE", file.path("DATA/FILE.parquet"),memory = FALSE)})
user system elapsed
0.00 0.00 0.26
如您所见,这花费了相当长的时间.我想知道在第一行代码( spark_read_csv
)中使用 memory = FALSE
会发生什么?它在哪里读取/保存到?并且当我断开连接并再次重新连接会话时可以访问该位置吗?
As you can see this takes quite a long time. I was wondering what happens in the first line of code (spark_read_csv
) with memory = FALSE
? Where does it read/save it to? and can I access that location when I disconnect and reconnect the session again?
还有一种方法可以合并步骤1和2以更有效的方式?
Also, is there a way to combine step 1 & 2 in a more efficient way?
考虑到它简单易行并且可以在很大程度上实现自动化,我不愿意尝试使用API中尚不可用的低级功能.
I am not shy to try and use lower level functions that aren't available in the API yet given that it is simple and can be automated to a large degree.
推荐答案
使用 memory = FALSE
调用 spark_read_csv
时,不会保存任何数据.您的延迟与这样的数据加载无关,而与架构推断过程有关,后者需要单独的数据扫描.
No data is saved when spark_read_csv
is invoked with memory = FALSE
. The delay you is related not to data loading as such, but to schema inference process, which requires a separate data scan.
使用模式推断虽然很方便,但在性能上要好得多,显式地提供模式(作为命名向量)从列名映射到
As convenient as it is to use schema inference, it is much better performance-wise to provide schema explicitly, as named vector, mapping from column names to to type's simple string. For example if you were to load iris dataset in a local
mode:
path <- tempfile()
readr::write_csv(iris, path)
您要使用的
spark_read_csv(
sc, "iris", path, infer_schema=FALSE, memory = FALSE,
columns = c(
Sepal_Length = "double", Sepal_Width = "double",
Petal_Length = "double", Petal_Width = "double",
Species = "string"))
这篇关于SparklyR:直接转换为实木复合地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!