对R中分为N个块的数据运行迭代回归 [英] Running iterated regressions for data divided into N chunks in R

查看:85
本文介绍了对R中分为N个块的数据运行迭代回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个结构如下的数据框:

I have a dataframe structured like the following:

birthwt  tobacco01  pscore  pscoreblocks
3425     0          0.18    (0.177, 0.187]
3527     1          0.15    (0.158, 0.168]
1638     1          0.34    (0.335, 0.345]

birthwt列是一个连续变量,以克为单位测量出生体重. pepper01列包含0或1的值.pscore列包含0到1之间的概率值.pscoreblocks接受pscore列并将其分解为100个大小相等的块.

The birthwt column is a continuous variable measuring birth weight in grams. The tobacco01 column contains values of 0 or 1. The pscore column contains probability values between 0 and 1. The pscoreblocks takes the pscore column and breaks it down into 100 equally sized blocks.

我正在尝试找到一种有效的方法来对pscoreblocks中的每个块执行以下操作.我已经包含了代码,如果我在整个数据集上运行而不将其划分为多个块,这些代码将起作用.

I am trying to find an efficient way to do the following for each of the blocks in pscoreblocks. I have included the code that would work if I was running this on the entire dataset without partitioning into blocks.

1-运行回归.

one <- lm(birthwt ~ tobacco01, dfc)

2-在回归分析中获取烟草01变量的系数值.

2- Take the value of the coefficient on the tobacco01 variable in the regression.

two <- summary(one)$coefficients[2,1]

3-将该系数值乘以:[(该区块中烟草= 1的人数)+(烟草的人数= 烟草==在该区块中为0)]/(该区块中的总人数 阻止)

3- Multiply that coefficient value by: [(the number of people for whom tobacco == 1 in that block) + (the number of people for whom tobacco == 0 in that block)] / (the total number of people in that block)

two_5 <- ((sum(dfc$tobacco01 == 1)) + (sum(dfc$tobacco01 == 0)))/ sum(dfc$tobacco)

three <- two*two_5

4-最后,我希望能够将所有(100)块的(3)中的所有值相加.

4- Finally, I would like to be able to add up all the values from (3) for all 100 blocks.

我知道如何分别执行每个步骤,但是我不知道如何在100个单独的块中进行迭代.我尝试使用group_by(pscoreblocks),然后运行回归,但看起来group_by()和lm()不能很好地协同工作.我还考虑过使用ivot_longer()为每个块创建一个单独的列,然后尝试使用该格式的数据运行回归.我非常感谢有关如何遍历所有100个块的任何建议.

I know how to do each of these steps individually, but I don't know how to iterate them over 100 separate blocks. I tried using group_by(pscoreblocks) and then running a regression, but it looks like group_by() and lm() do not work well together. I have also considered using pivot_longer() to create a separate column for each block and then trying to run the regressions with the data in that format. I'd really appreciate any suggestions for how to iterate over all 100 blocks.

数据:

> small <- dput(dfcsmall[1:40,])
structure(list(dbrwt = c(3629, 3005, 3459, 4520, 3095.17811313023, 
3714, 3515, 3232, 3686, 4281, 2645.29691556227, 3714, 3232, 3374, 
3856, 3997, 3515, 3714, 3459, 3232, 3884, 3235, 3008.94507753983, 
3799, 2940, 3389.51332290472, 3090, 1701, 3363, 3033, 2325, 3941, 
3657, 3600, 3005, 4054, 3856, 3402, 2694.09822203382, 3413.03869100037
), tobacco01 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 1, 1), pscore = c(0.00988756408875347, 0.183983728674846, 
0.24538311074894, 0.170701594663405, 0.179337494008595,         0.0770304781540708, 
0.164003166666384, 0.0773042518100593, 0.0804603038634144,     0.0611822720382283, 
0.481204657069376, 0.166016137665693, 0.107882394783232,     0.149799473798458, 
0.04130366288307, 0.0360272679038012, 0.476513676221723, 0.214910849480014, 
0.0687582392973688, 0.317662260996216, 0.206183065905609,     0.336553699970873, 
0.0559863953956171, 0.103064791185442, 0.0445362319933672,     0.17097032928289, 
0.245898950803051, 0.146235179401833, 0.284345485401689,     0.152121397241563, 
0.0395696572471225, 0.116669642645446, 0.0672219220193578,     0.297173652687617, 
0.436771917147971, 0.0517299620576624, 0.140760280612358,     0.179726730598874, 
0.0118610298424373, 0.162996197785343), pscoreblocks = structure(c(1L, 
19L, 25L, 18L, 19L, 8L, 17L, 8L, 9L, 7L, 49L, 17L, 11L, 16L, 
5L, 4L, 49L, 22L, 7L, 33L, 21L, 35L, 6L, 11L, 5L, 18L, 25L, 15L, 
29L, 16L, 5L, 12L, 7L, 31L, 45L, 6L, 15L, 19L, 2L, 17L), .Label = c("    [3.88e-05,0.0099]", 
"(0.0099,0.0198]", "(0.0198,0.0296]", "(0.0296,0.0395]", "    (0.0395,0.0493]", 
"(0.0493,0.0592]", "(0.0592,0.069]", "(0.069,0.0789]", "(0.0789,0.0888]", 
"(0.0888,0.0986]", "(0.0986,0.108]", "(0.108,0.118]", "(0.118,0.128]", 
"(0.128,0.138]", "(0.138,0.148]", "(0.148,0.158]", "(0.158,0.168]", 
"(0.168,0.177]", "(0.177,0.187]", "(0.187,0.197]", "(0.197,0.207]", 
"(0.207,0.217]", "(0.217,0.227]", "(0.227,0.237]", "(0.237,0.246]", 
"(0.246,0.256]", "(0.256,0.266]", "(0.266,0.276]", "(0.276,0.286]", 
"(0.286,0.296]", "(0.296,0.306]", "(0.306,0.315]", "(0.315,0.325]", 
"(0.325,0.335]", "(0.335,0.345]", "(0.345,0.355]", "(0.355,0.365]", 
"(0.365,0.375]", "(0.375,0.384]", "(0.384,0.394]", "(0.394,0.404]", 
"(0.404,0.414]", "(0.414,0.424]", "(0.424,0.434]", "(0.434,0.444]", 
"(0.444,0.453]", "(0.453,0.463]", "(0.463,0.473]", "(0.473,0.483]", 
"(0.483,0.493]", "(0.493,0.503]", "(0.503,0.513]", "(0.513,0.522]", 
"(0.522,0.532]", "(0.532,0.542]", "(0.542,0.552]", "(0.552,0.562]", 
"(0.562,0.572]", "(0.572,0.582]", "(0.582,0.591]", "(0.591,0.601]", 
"(0.601,0.611]", "(0.611,0.621]", "(0.621,0.631]", "(0.631,0.641]", 
"(0.641,0.651]", "(0.651,0.66]", "(0.66,0.67]", "(0.67,0.68]", 
"(0.68,0.69]", "(0.69,0.7]", "(0.7,0.71]", "(0.71,0.72]", "(0.72,0.73]", 
"(0.73,0.739]", "(0.739,0.749]", "(0.749,0.759]", "(0.759,0.769]", 
"(0.769,0.779]", "(0.779,0.789]", "(0.789,0.799]", "(0.799,0.808]", 
"(0.808,0.818]", "(0.818,0.828]", "(0.828,0.838]", "(0.838,0.848]", 
"(0.848,0.858]", "(0.858,0.868]", "(0.868,0.877]", "(0.877,0.887]", 
"(0.887,0.897]", "(0.897,0.907]", "(0.907,0.917]", "(0.917,0.927]", 
"(0.927,0.937]", "(0.937,0.946]", "(0.946,0.956]", "(0.956,0.966]", 
"(0.966,0.976]", "(0.976,0.986]"), class = "factor"), blocknumber = c(1L, 
19L, 25L, 18L, 19L, 8L, 17L, 8L, 9L, 7L, 49L, 17L, 11L, 16L, 
5L, 4L, 49L, 22L, 7L, 33L, 21L, 35L, 6L, 11L, 5L, 18L, 25L, 15L, 
29L, 16L, 5L, 12L, 7L, 31L, 45L, 6L, 15L, 19L, 2L, 17L)), row.names =     c(NA, 
-40L), class = c("tbl_df", "tbl", "data.frame"))

推荐答案

您可以创建一个函数以应用于每个pscoreblocks.

You could create a function to apply to each pscoreblocks.

apply_model <- function(data) {
   one <- lm(birthwt ~ tobacco01, data)
   two <- summary(one)$coefficients[2,1]
   two_5 <- ((sum(data$tobacco01 == 1)) + (sum(data$tobacco01 == 0)))/ sum(data$tobacco)
   three <- two*two_5
   return(three)
}

将数据拆分为spearate数据帧,并将此功能应用于每个块.

Split the data into spearate dataframe and apply this function to each chunk.

library(dplyr)
library(purrr)

dfc %>% group_split(pscoreblocks) %>% map(apply_model)
#OR
#dfc %>% group_split(pscoreblocks) %>% map_dbl(apply_model)

您也可以使用基数R:

lapply(split(dfc, dfc$pscoreblocks), apply_model)

或使用by:

by(dfc, dfc$pscoreblocks, apply_model)

这篇关于对R中分为N个块的数据运行迭代回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆