dplyr:mutate中的整数采样 [英] dplyr: Integer sampling within mutate

查看:101
本文介绍了dplyr:mutate中的整数采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试生成一个 tbl_df 中的列,该列是0或1的随机整数。这是我使用的代码:

 库(dplyr)
set.seed(0)

#Dummy data.frame以测试
df< - tbl_df(data.frame(x = rep(1:3,each = 4)))

#生成随机整数列
df_test = df%>%
mutate(pop = sample(0:1,1,replace = TRUE))

但这似乎并不像我预期的那样工作。我生成的字段似乎全为零。这是因为 mutate 中的语句是并行评估的,因此最终使用相同的种子进行第一次随机绘制?



来源:本地数据框[12 x 2]

x pop
1 1 0
2 1 0
3 1 0
4 1 0
5 2 0
6 2 0
7 2 0
8 2 0
9 3 0
10 3 0
11 3 0
12 3 0

I在过去几个小时内,我打破了我的头脑。任何想法我的脚本中的缺陷是什么?

解决方案

你的代码写的方式,你分配一个值随机绘制的结果)到整个向量(这被称为向量循环)。



在这种情况下,最好的解决方案是StevenBeaupré的答案,创建一个随机的向量您的data.frame的长度:

  df%>%
mutate(pop = sample(0:1 ,n(),replace = TRUE))






一般来说,如果要在 dplyr 中逐行应用一个函数,就像您以前想到的一样,您可以使用 rowwise(),虽然在这个例子中不是必需的。



下面是一个 rowwise()

  df2<  -  data.frame(a = c(1,3,6),b = c(2, 4,5))

df2%>%
mutate(m = max(a,b))

abm
1 1 2 6
2 3 4 6
3 6 5 6

df2%>%
ro wwise()%>%
mutate(m = max(a,b))

abm
1 1 2 2
2 3 4 4
3 6 5 6

由于 rowwise 每行操作的数据可能比没有任何分组慢。因此,最好使用向量化函数,而不是逐行使用。






基准:



使用 rowwise()的方法速度约为30倍:

 库(microbenchmark)
df < - tbl_df(data.frame(x = rep(1:1000,each = 4)))
bench< - microbenchmark(
vectorized = df2 <-df%>%mutate(pop = sample(0:1,n(),replace = TRUE)),
rowwise = df2 <-df%>%rowwise()%>%mutate(pop = sample(0:1,1,replace = TRUE)),
times = 1000


选项(microbenchmark.unit =relative)
print(bench)
autoplot(bench)

单位:relative
expr min lq mean中位数uq max neval
向量化1.00000 1.00000 1.00000 1.00000 1.00000 1.0000 1000
rowwise 42.53169 42.29486 36.94876 33.70456 34.92621 71.7682 1000


I am trying to generate a column in a tbl_df that is a random integer of 0 or 1. This is the code I am using:

library(dplyr)
set.seed(0)

#Dummy data.frame to test
df <- tbl_df(data.frame(x = rep(1:3, each = 4)))

#Generate the random integer column
df_test = df %>% 
  mutate(pop=sample(0:1, 1, replace=TRUE))

But this does not seem to work the way I expected. The field I generated seems to be all zeros. Is this because the statement within mutate is evaluated in parallel and hence ends up using the same seed for the first random draw?

df_test 
Source: local data frame [12 x 2]

   x pop
1  1   0
2  1   0
3  1   0
4  1   0
5  2   0
6  2   0
7  2   0
8  2   0
9  3   0
10 3   0
11 3   0
12 3   0

I am breaking my head over this the past few hours. Any idea what is the flaw in my script?

解决方案

The way your code is written, you are assigning a single value (the result of the random draw) to the entire vector (this is called "vector recycling").

The best solution in this case is Steven Beaupré's answer, creating a randomized vector the length of your data.frame:

df %>% 
  mutate(pop = sample(0:1, n(), replace = TRUE))


Generally, if you want to apply a function row-by-row in dplyr - as you thought would happen here - you can use rowwise(), though in this example it's not required.

Here's an example of rowwise():

df2 <- data.frame(a = c(1,3,6), b = c(2,4,5))

df2 %>%
  mutate(m = max(a,b))

  a b m
1 1 2 6
2 3 4 6
3 6 5 6

df2 %>%
  rowwise() %>%
  mutate(m = max(a,b))

  a b m
1 1 2 2
2 3 4 4
3 6 5 6

Since rowwise groups the data by each row operations are potentially slower than without any grouping. Therefore, it's mostly better to use vectorized functions whenever possible instead of operating row-by-row.


Benchmarking:

The approach with rowwise() is about 30x slower:

library(microbenchmark)
df <- tbl_df(data.frame(x = rep(1:1000, each = 4)))
bench <- microbenchmark(
  vectorized = df2 <- df %>% mutate(pop = sample(0:1, n(), replace = TRUE)),
  rowwise = df2 <- df %>% rowwise() %>% mutate(pop = sample(0:1, 1, replace = TRUE)),
  times = 1000
  )

options(microbenchmark.unit="relative")
print(bench)
autoplot(bench)

Unit: relative
       expr      min       lq     mean   median       uq     max neval
 vectorized  1.00000  1.00000  1.00000  1.00000  1.00000  1.0000  1000
    rowwise 42.53169 42.29486 36.94876 33.70456 34.92621 71.7682  1000

这篇关于dplyr:mutate中的整数采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆