与sparklyr一起使用时sample_n真的是随机样本吗? [英] Is sample_n really a random sample when used with sparklyr?

查看：183 发布时间：2020/7/4 0:29:12 r apache-spark random dplyr sparklyr

本文介绍了与sparklyr一起使用时sample_n真的是随机样本吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在spark数据框中有5亿行.我对从dplyr使用sample_n感兴趣，因为它可以让我明确指定所需的样本大小.如果要使用sparklyr::sdf_sample()，则首先必须计算sdf_nrow()，然后创建数据sample_size / nrow的指定分数，然后将该分数传递给sdf_sample.这没什么大不了，但是sdf_nrow()可能需要一段时间才能完成.

I have 500 million rows in a spark dataframe. I'm interested in using sample_n from dplyr because it will allow me to explicitly specify the sample size I want. If I were to use sparklyr::sdf_sample(), I would first have to calculate the sdf_nrow(), then create the specified fraction of data sample_size / nrow, then pass this fraction to sdf_sample. This isn't a big deal, but the sdf_nrow() can take a while to complete.

因此，直接使用dplyr::sample_n()是理想的.但是，经过一些测试，sample_n()看起来并不是随机的.实际上，结果与head()相同！如果该函数只是返回前n行，而不是随机采样行，那将是一个主要问题.

So, it would be ideal to use dplyr::sample_n() directly. However, after some testing, it doesn't look like sample_n() is random. In fact, the results are identical to head()! It would be a major issue if instead of sampling rows at random, the function were just returning the first n rows.

还有其他人可以确认吗? sdf_sample()是我最好的选择吗?

Can anyone else confirm this? Is sdf_sample() my best option?

# install.packages("gapminder")

library(gapminder)
library(sparklyr)
library(purrr)

sc <- spark_connect(master = "yarn-client")

spark_data <- sdf_import(gapminder, sc, "gapminder")


> # Appears to be random
> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    58.83397


> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    60.31693


> spark_data %>% sdf_sample(fraction = 0.20, replace = FALSE) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.38692
> 
> 
> # Appears to be random
> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    60.48903


> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.44187


> spark_data %>% sample_frac(0.20) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    59.27986
> 
> 
> # Does not appear to be random
> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434


> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434


> spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp))
# Source:   lazy query [?? x 1]
# Database: spark_connection
  sample_mean
        <dbl>
1    57.78434
> 
> 
> 
> # === Test sample_n() ===
> sample_mean <- list()
> 
> for(i in 1:20){
+   
+   sample_mean[i] <- spark_data %>% sample_n(300) %>% summarise(sample_mean = mean(lifeExp)) %>% collect() %>% pull()
+   
+ }
> 
> 
> sample_mean %>% flatten_dbl() %>% mean()
[1] 57.78434
> sample_mean %>% flatten_dbl() %>% sd()
[1] 0
> 
> 
> # === Test head() ===
> spark_data %>% 
+   head(300) %>% 
+   pull(lifeExp) %>% 
+   mean()
[1] 57.78434

推荐答案

不是.如果检查执行计划(在此定义的optimizedPlan函数)，您将发现它只是一个限制:

It is not. If you check the execution plan (optimizedPlan function as defined here) you'll see it is just a limit:

spark_data %>% sample_n(300) %>% optimizedPlan()

<jobj[168]>
  org.apache.spark.sql.catalyst.plans.logical.GlobalLimit
  GlobalLimit 300
+- LocalLimit 300
   +- InMemoryRelation [country#151, continent#152, year#153, lifeExp#154, pop#155, gdpPercap#156], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `gapminder`
         +- Scan ExistingRDD[country#151,continent#152,year#153,lifeExp#154,pop#155,gdpPercap#156]

show_query进一步证实了这一点:

This further confirmed by the show_query:

spark_data %>% sample_n(300) %>% show_query()

<SQL>
SELECT *
FROM (SELECT *
FROM `gapminder` TABLESAMPLE (300 rows) ) `hntcybtgns`

和可视化的执行计划:

最后，如果您选中

Finally if you check Spark source you'll see that this case is implemented with simple LIMIT:

case ctx: SampleByRowsContext =>
  Limit(expression(ctx.expression), query)

我认为这种语义是从Hive继承的.其中等价查询需要n每个输入拆分的第一行.

I believe that this semantics has been inherited from Hive where equivalent query takes n first rows from each input split.

实际上，获取准确大小的样本非常昂贵，除非绝对必要，否则应避免使用(与大型LIMITS相同).

In practice getting a sample of an exact size is just very expensive, and you should avoid unless strictly necessary (same as large LIMITS).

这篇关于与sparklyr一起使用时sample_n真的是随机样本吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

与sparklyr一起使用时sample_n真的是随机样本吗? [英] Is sample_n really a random sample when used with sparklyr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

与sparklyr一起使用时sample_n真的是随机样本吗? [英] Is sample_n really a random sample when used with sparklyr?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭