将数据拆分为多个块并在R中的每个块上进行迭代 [英] Splitting data into chunks and iterating over each chunk in R

查看:87
本文介绍了将数据拆分为多个块并在R中的每个块上进行迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个结构如下的数据框:

I have a dataframe structured like this:

birthwt  tobacco01  pscore  pscoreblocks     blocknumber
3425     0          0.18    (0.177, 0.187]   1
3527     1          0.15    (0.158, 0.168]   2
1638     1          0.34    (0.335, 0.345]   3

解释数据:birthwt列是一个连续变量,以克为单位来衡量出生体重. pepper01列包含0或1的值.pscore列包含0到1之间的概率值.pscoreblocks接受pscore列并将其分解为100个大小相等的块.块号为每个块提供一个数字,因此它从1到100.

Explaining the data: The birthwt column is a continuous variable measuring birth weight in grams. The tobacco01 column contains values of 0 or 1. The pscore column contains probability values between 0 and 1. The pscoreblocks takes the pscore column and breaks it down into 100 equally sized blocks. The block number provides a number for each block, so it goes from 1 to 100.

我正在尝试对pscoreblocks中的每个块执行以下操作.

I am trying to do the following for each of the blocks in pscoreblocks.

apply_model <- function(data) {
   one <- lm(birthwt ~ tobacco01, data)
   two <- one$coefficients[[2]]
   two_5 <- ((sum(data$tobacco01 == 1)) + (sum(data$tobacco01 == 0)))/ sum(data$tobacco)
   three <- two*two_5
   return(three)
}

方法1 :一种方法(效率低下)是使用过滤器为每个块创建一个单独的数据框.

Method 1: One way of doing this (inefficiently) would be to use filter to create a separate dataframe for each block.

data1 <- data %>% filter(blocknumber == 1) 

然后我可以在每个块上手动运行上面的函数.

I could then manually run the function above on each block.

方法2 :但是,我希望能够有效地运行100个块.

Method 2: However, I would like to be able to run this efficiently for 100 blocks.

已提出以下解决方案我在这里得到的结果与使用此命令时的结果相同:

I get the same results here as when I use this:

lapply(split(data, data$blocknumber), apply_model)

问题:

当我将使用方法1时得到的值与使用方法2时得到的值进行比较时,我期望得到相同的结果.如果我滤出编号为1的块并运行分析,而不是在第二种方法中查看标记为(1)的值,则不会得到相同的值.为什么我在这里没有得到相同的价值?

When I compare the values I get when using Method 1 to the value using Method 2, I was expecting to get the same results. If I filter out block number 1 and run the analysis vs. looking at the value labeled (1) in the 2nd method, I do not get the same values. Why am I not getting the same value here?

更笼统地说,我如何基于列值将数据拆分为多个块,然后迭代运行一个函数,该函数所涉及的术语指的是所使用的数据帧?

More generally, how do I split the data into chunks based on a column value and then iterate to run a function that involves a term that refers to the dataframe being used?

可复制的示例:

> small <- dput(dfcsmall[1:40,])
structure(list(birthwt = c(3629, 3005, 3459, 4520, 3095.17811313023, 
3714, 3515, 3232, 3686, 4281, 2645.29691556227, 3714, 3232, 3374, 
3856, 3997, 3515, 3714, 3459, 3232, 3884, 3235, 3008.94507753983, 
3799, 2940, 3389.51332290472, 3090, 1701, 3363, 3033, 2325, 3941, 
3657, 3600, 3005, 4054, 3856, 3402, 2694.09822203382, 3413.03869100037
), tobacco01 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 1, 1), pscore = c(0.00988756408875347, 0.183983728674846, 
0.24538311074894, 0.170701594663405, 0.179337494008595,         0.0770304781540708, 
0.164003166666384, 0.0773042518100593, 0.0804603038634144,     0.0611822720382283, 
0.481204657069376, 0.166016137665693, 0.107882394783232,     0.149799473798458, 
0.04130366288307, 0.0360272679038012, 0.476513676221723, 0.214910849480014, 
0.0687582392973688, 0.317662260996216, 0.206183065905609,     0.336553699970873, 
0.0559863953956171, 0.103064791185442, 0.0445362319933672,     0.17097032928289, 
0.245898950803051, 0.146235179401833, 0.284345485401689,     0.152121397241563, 
0.0395696572471225, 0.116669642645446, 0.0672219220193578,     0.297173652687617, 
0.436771917147971, 0.0517299620576624, 0.140760280612358,     0.179726730598874, 
0.0118610298424373, 0.162996197785343), pscoreblocks = structure(c(1L, 
19L, 25L, 18L, 19L, 8L, 17L, 8L, 9L, 7L, 49L, 17L, 11L, 16L, 
5L, 4L, 49L, 22L, 7L, 33L, 21L, 35L, 6L, 11L, 5L, 18L, 25L, 15L, 
29L, 16L, 5L, 12L, 7L, 31L, 45L, 6L, 15L, 19L, 2L, 17L), .Label = c("    [3.88e-05,0.0099]", 
"(0.0099,0.0198]", "(0.0198,0.0296]", "(0.0296,0.0395]", "    (0.0395,0.0493]", 
"(0.0493,0.0592]", "(0.0592,0.069]", "(0.069,0.0789]", "(0.0789,0.0888]", 
"(0.0888,0.0986]", "(0.0986,0.108]", "(0.108,0.118]", "(0.118,0.128]", 
"(0.128,0.138]", "(0.138,0.148]", "(0.148,0.158]", "(0.158,0.168]", 
"(0.168,0.177]", "(0.177,0.187]", "(0.187,0.197]", "(0.197,0.207]", 
"(0.207,0.217]", "(0.217,0.227]", "(0.227,0.237]", "(0.237,0.246]", 
"(0.246,0.256]", "(0.256,0.266]", "(0.266,0.276]", "(0.276,0.286]", 
"(0.286,0.296]", "(0.296,0.306]", "(0.306,0.315]", "(0.315,0.325]", 
"(0.325,0.335]", "(0.335,0.345]", "(0.345,0.355]", "(0.355,0.365]", 
"(0.365,0.375]", "(0.375,0.384]", "(0.384,0.394]", "(0.394,0.404]", 
"(0.404,0.414]", "(0.414,0.424]", "(0.424,0.434]", "(0.434,0.444]", 
"(0.444,0.453]", "(0.453,0.463]", "(0.463,0.473]", "(0.473,0.483]", 
"(0.483,0.493]", "(0.493,0.503]", "(0.503,0.513]", "(0.513,0.522]", 
"(0.522,0.532]", "(0.532,0.542]", "(0.542,0.552]", "(0.552,0.562]", 
"(0.562,0.572]", "(0.572,0.582]", "(0.582,0.591]", "(0.591,0.601]", 
"(0.601,0.611]", "(0.611,0.621]", "(0.621,0.631]", "(0.631,0.641]", 
"(0.641,0.651]", "(0.651,0.66]", "(0.66,0.67]", "(0.67,0.68]", 
"(0.68,0.69]", "(0.69,0.7]", "(0.7,0.71]", "(0.71,0.72]", "(0.72,0.73]", 
"(0.73,0.739]", "(0.739,0.749]", "(0.749,0.759]", "(0.759,0.769]", 
"(0.769,0.779]", "(0.779,0.789]", "(0.789,0.799]", "(0.799,0.808]", 
"(0.808,0.818]", "(0.818,0.828]", "(0.828,0.838]", "(0.838,0.848]", 
"(0.848,0.858]", "(0.858,0.868]", "(0.868,0.877]", "(0.877,0.887]", 
"(0.887,0.897]", "(0.897,0.907]", "(0.907,0.917]", "(0.917,0.927]", 
"(0.927,0.937]", "(0.937,0.946]", "(0.946,0.956]", "(0.956,0.966]", 
"(0.966,0.976]", "(0.976,0.986]"), class = "factor"), blocknumber = c(1L, 
19L, 25L, 18L, 19L, 8L, 17L, 8L, 9L, 7L, 49L, 17L, 11L, 16L, 
5L, 4L, 49L, 22L, 7L, 33L, 21L, 35L, 6L, 11L, 5L, 18L, 25L, 15L, 
29L, 16L, 5L, 12L, 7L, 31L, 45L, 6L, 15L, 19L, 2L, 17L)), row.names =     c(NA, 
-40L), class = c("tbl_df", "tbl", "data.frame"))

推荐答案

它们给出的结果相同,但是我相信您是基于 position 而不是 name 进行索引:

They are giving equivalent results, but I believe you're indexing based on position instead of name:

data %>% filter(blocknumber == 6) %>% apply_model()
# [1] -2090

如果我们然后尝试用位置6索引model_list,则它不等于:

If we then try to index model_list with position 6, it's not equal:

data_split <- data %>% group_split(blocknumber)
models <- data_split %>% map(apply_model)
models[[6]]
# [1] NA

但这是因为data_split[[6]]data %>% filter(blocknumber == 6)不同:

data_split[[6]]
# # A tibble: 3 x 5
#   birthwt tobacco01 pscore pscoreblocks   blocknumber
#     <dbl>     <dbl>  <dbl> <fct>                <int>
# 1    4281         0 0.0612 (0.0592,0.069]           7
# 2    3459         0 0.0688 (0.0592,0.069]           7
# 3    3657         0 0.0672 (0.0592,0.069]           7

您可以通过分配名称然后按名称索引来解决此问题:

You can fix this by assigning names and then indexing by name:

names(models) <- data_split %>% map("blocknumber") %>% map_chr(unique)
models[["6"]]
# [1] -2090

base::split还会默认保留名称,所以我通常更喜欢使用它:

base::split also preserves names by default so I generally prefer to use it:

models <- data %>% split(.$blocknumber) %>% map(apply_model) 
models[["6"]]
# [1] -2090

这篇关于将数据拆分为多个块并在R中的每个块上进行迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆