SparkR瓶颈createDataFrame？ [英] SparkR bottleneck in createDataFrame?

查看：314 发布时间：2016/5/22 16:13:49 r apache-spark sparkr

本文介绍了SparkR瓶颈createDataFrame？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新的火花，SparkR一般所有HDFS相关的技术。我最近安装了星火1.5.0和SparkR运行一些简单的code：

I'm new to Spark, SparkR and generally all HDFS-related technologies. I've installed recently Spark 1.5.0 and run some simple code with SparkR:

Sys.setenv(SPARK_HOME="/private/tmp/spark-1.5.0-bin-hadoop2.6")
.libPaths("/private/tmp/spark-1.5.0-bin-hadoop2.6/R/lib")
require('SparkR')
require('data.table')

sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
hiveContext <- sparkRHive.init(sc)

n = 1000
x = data.table(id = 1:n, val = rnorm(n))

Sys.time()
xs <- createDataFrame(sqlContext, x)
Sys.time()

在code立即执行。然而，当我将其更改为ñ= 1000000 需要两个 Sys.time（）电话之间约4分钟（时间）。当我在控制台检查这些职位上的端口：4040，工作为 N = 1000 具有0.2秒的持续时间和作业为ñ= 1000000 0.3秒。难道我做错了什么？

The code executes immediately. However when I change it to n = 1000000 it takes about 4 minutes (time between two Sys.time() calls). When I check these jobs in console on port :4040, job for n = 1000 has duration 0.2s, and job for n = 1000000 0.3s. Am I doing something wrong?

推荐答案

您并没有做什么特别错误的。这只是不同因素组合的效果：

You're not doing anything particularly wrong. It is just an effect of a combination of different factors:

createDataFrame ，因为它是当前实现（星火1.5.1）是缓慢的。正是在 SPARK-8277 描述的已知问题。

目前的实现并不以 data.table 打好。

基，R是相对较慢。聪明的人说，这是一个功能不是一个错误，但它仍然是值得考虑的。

createDataFrame as it is currently (Spark 1.5.1) implemented is slow. It is a known issue described in SPARK-8277.
Current implementation doesn't play well with data.table.
Base R is relatively slow. Smart people say it is a feature not a bug but it is still something to consider.

直到SPARK-8277解决了没有什么可以做，但是有两个选择，你可以试试：

Until SPARK-8277 is resolved there is not much you can do but there two options you can try:

使用普通的旧式 data.frame 而不是 data.table 的。利用航班数据集（227496行，14列）：

use plain old data.frame instead of data.table. Using flights dataset (227496 rows, 14 columns):

df <- read.csv("flights.csv")
microbenchmark::microbenchmark(createDataFrame(sqlContext, df), times=3)

## Unit: seconds
##                             expr      min       lq     mean   median
##  createDataFrame(sqlContext, df) 96.41565 97.19515 99.08441 97.97465
##        uq      max neval
##  100.4188 102.8629     3

相比， data.table

dt <- data.table::fread("flights.csv")
microbenchmark::microbenchmark(createDataFrame(sqlContext, dt), times=3)

## Unit: seconds        
##                             expr      min       lq     mean  median
##  createDataFrame(sqlContext, dt) 378.8534 379.4482 381.2061 380.043
##        uq     max neval
##  382.3825 384.722     3

写入到磁盘，并使用火花CSV 来直接加载数据到数据框星火没有与R.那样疯狂的直接互动，因为它听起来：

Write to disk and use spark-csv to load data directly to Spark DataFrame without direct interaction with R. As crazy as it sounds:

dt <- data.table::fread("flights.csv")

write_and_read <- function() {
    write.csv(dt, tempfile(), row.names=FALSE)
    read.df(sqlContext, "flights.csv",
        source = "com.databricks.spark.csv",
        header = "true",
        inferSchema = "true"
    )
}

## Unit: seconds
##              expr      min       lq     mean   median
##  write_and_read() 2.924142 2.959085 2.983008 2.994027
##       uq      max neval
##  3.01244 3.030854     3

我真的不知道，如果真的是有意义的推动，可在R上处理的数据摆在首位，以星火，但让我们不要纠缠于这一点。

I am not really sure if really it makes sense to push data that can be handled in R to Spark in the first place but lets not dwell on that.

修改

这问题应该由 SPARK-11086 在星火1.6.0解决

This issue should be resolved by SPARK-11086 in Spark 1.6.0.

这篇关于SparkR瓶颈createDataFrame？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SparkR瓶颈createDataFrame？ [英] SparkR bottleneck in createDataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SparkR瓶颈createDataFrame？ [英] SparkR bottleneck in createDataFrame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭