SparklyR:直接转换为实木复合地板 [英] SparklyR: Convert directly to parquet

查看:77
本文介绍了SparklyR:直接转换为实木复合地板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是集群计算的新手,目前我只在独立集群上玩( sc<-spark_connect(master ="local",version ='2.0.2')).我有一个庞大的csv文件(15GB),我想将其转换为镶木地板文件(第三段代码说明了原因).这个15GB的文件已经是6​​0GB文件的示例,当我停止播放环绕声时,我需要使用/查询完整的60GB文件.目前我所做的是:

Hi I am new to cluster computing and currently I am only playing around on the standalone cluster (sc <- spark_connect(master = "local", version = '2.0.2')). I have a massive csv file (15GB) which I would like to convert to a parquet file (Third chunk of code explains why). This 15GB file is already a sample of a 60GB file, and I need to use/query the full 60GB file when I stop playing arround. Currently what I did was:

> system.time({FILE<-spark_read_csv(sc,"FILE",file.path("DATA/FILE.csv"),memory = FALSE)})
   user  system elapsed 
   0.16    0.04 1017.11 
> system.time({spark_write_parquet(FILE, file.path("DATA/FILE.parquet"),mode='overwrite')})
   user  system elapsed 
   0.92    1.48 1267.72 
> system.time({FILE<-spark_read_parquet(sc,"FILE", file.path("DATA/FILE.parquet"),memory = FALSE)})
   user  system elapsed 
   0.00    0.00    0.26 

如您所见,这花费了相当长的时间.我想知道在第一行代码( spark_read_csv )中使用 memory = FALSE 会发生什么?它在哪里读取/保存到?并且当我断开连接并再次重新连接会话时可以访问该位置吗?

As you can see this takes quite a long time. I was wondering what happens in the first line of code (spark_read_csv) with memory = FALSE ? Where does it read/save it to? and can I access that location when I disconnect and reconnect the session again?

还有一种方法可以合并步骤1和2以更有效的方式?

Also, is there a way to combine step 1 & 2 in a more efficient way?

考虑到它简单易行并且可以在很大程度上实现自动化,我不愿意尝试使用API​​中尚不可用的低级功能.

I am not shy to try and use lower level functions that aren't available in the API yet given that it is simple and can be automated to a large degree.

推荐答案

使用 memory = FALSE 调用 spark_read_csv 时,不会保存任何数据.您的延迟与这样的数据加载无关,而与架构推断过程有关,后者需要单独的数据扫描.

No data is saved when spark_read_csv is invoked with memory = FALSE. The delay you is related not to data loading as such, but to schema inference process, which requires a separate data scan.

使用模式推断虽然很方便,但在性能上要好得多,显式地提供模式(作为命名向量)从列名映射到

As convenient as it is to use schema inference, it is much better performance-wise to provide schema explicitly, as named vector, mapping from column names to to type's simple string. For example if you were to load iris dataset in a local mode:

path <- tempfile()
readr::write_csv(iris, path)

您要使用的

spark_read_csv(
  sc, "iris", path, infer_schema=FALSE, memory = FALSE,
  columns = c(
    Sepal_Length = "double", Sepal_Width = "double", 
    Petal_Length = "double", Petal_Width = "double",
    Species = "string"))

这篇关于SparklyR:直接转换为实木复合地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆