遍历子集,获取文件并将结果保存在数据框中 [英] Loop over a subset, source a file and save results in a dataframe

查看:138
本文介绍了遍历子集,获取文件并将结果保存在数据框中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经问过类似的问题,但没有一个能够解决我的特定问题.我有一个.R文件("Mycalculus.R"),其中包含许多我需要应用于数据框子集的基本演算:每年一个子集,其中年份"的模式是因子(yearA,yearB,yearC)不是数值.该文件生成一个新的数据框,我需要将其保存在Rda文件中.这是我期望代码在for循环中看起来像的样子(显然这是行不通的):

Similar questions have been asked already but none was able to solve my specific problem. I have a .R file ("Mycalculus.R") containing many basic calculus that I need to apply to subsets of a dataframe: one subset for each year where the modalities of "year" are factors (yearA, yearB, yearC) not numeric values. The file generates a new dataframe that I need to save in a Rda file. Here is what I expect the code to look like with a for loop (this one obviously do not work):

id <- identif(unlist(df$year))
for (i in 1:length(id)){
    data <- subset(df, year == id[i])
    source ("Mycalculus.R", echo=TRUE)
    save(content_df1,file="myresults.Rda")
}

以下是主要数据的精确信息.帧df:

Here is an exact of the main data.frame df:

obs    year    income    gender   ageclass    weight
 1     yearA    1000       F         1          10
 2     yearA    1200       M         2          25
 3     yearB    1400       M         2           5
 4     yearB    1350       M         1          11

这是源文件"Mycalculus.R"的作用:它将大量基本演算应用于称为数据"的数据帧的列,并创建两个新的数据帧df1,然后基于df1创建df2.这是摘录:

Here is what the sourced file "Mycalculus.R" do: it applies numerous basic calculus to columns of the dataframe called "data", and creates two new dataframes df1 and then df2 based on df1. Here is an extract:

data <- data %>% 
   group_by(gender) %>% 
   mutate(Income_gender = weighted.mean(income, weight))
data <- data %>% 
   group_by(ageclass) %>% 
   mutate(Income_ageclass = weighted.mean(income, weight))

library(GiniWegNeg)
gini=c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))

df1=data.frame(gini)
colnames(df1) <- c("Income_gender","Income_ageclass")
rownames(df1) <- c("content_df1")

df2=(1/5)*df1$Income_gender+df2$Income_ageclass
colnames(df2) <- c("myresult")
rownames(df2) <- c("content_df2")

最后,我得到了两个这样的数据帧:

So that in the end, I get two dataframes like this:

                    Income_Gender  Income_Ageclass    
content_df1           ....             ....     

对于df2:

                    myresult      
content_df2           ....          

但是我需要将df1和Rf2保存为一个Rda文件,其中每个子集都给出了content_df1和content_df2的行名,如下所示:

But I need to save df1 and Rf2 as a Rda file where the row names of content_df1 and content_df2 are given per subset, something like this:

                    Income_Gender  Income_Ageclass    
content_df1_yearA     ....             ....     
content_df1_yearB     ....             ....     
content_df1_yearC     ....             ....     

                    myresult
content_df2_yearA     ....   
content_df2_yearB     ....    
content_df2_yearC     ....   

当前,我的程序不使用任何循环,并且正在杂乱无章地完成工作.基本上,该代码是2500行以上的代码. (请不要向我扔西红柿).

Currently, my program does not use any loop and is doing the job but messily. Basically the code is more than 2500 lines of code. (please don't throw tomatoes at me).

有人可以帮助我解决这个特定要求吗? 预先谢谢你.

Anyone could help me with this specific request? Thank you in advance.

推荐答案

考虑将所有内容与所需参数的定义函数(由调用)合并到一个脚本中.然后,Lapply返回一个数据帧列表,您可以将其绑定到一个最终的df中.

Consider incorporating all in one script with a defined function of needed arguments, called by lapply(). Lapply then returns a list of dataframes that you can rowbind into one final df.

library(dplyr)
library(GiniWegNeg)

runIncomeCalc <- function(data, y){      
  data <- data %>% 
    group_by(gender) %>% 
    mutate(Income_gender = weighted.mean(income, weight))
  data <- data %>% 
    group_by(ageclass) %>% 
    mutate(Income_ageclass = weighted.mean(income, weight))      

  gini <- c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))

  df1 <- data.frame(gini)
  colnames(df1) <- c("Income_gender","Income_ageclass")
  rownames(df1) <- c(paste0("content_df1_", y))

  return(df1)
}

runResultsCalc <- function(df, y){
  df2 <- (1/5) * df$Income_gender + df$Income_ageclass
  colnames(df2) <- c("myresult")
  rownames(df2) <- c(paste0("content_df2_", y)

  return(df2)
}

dfIncList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  runIncomeCalc(yeardata, i)      
})

dfResList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  df <- runIncomeCalc(yeardata, i) 
  runResultsCalc(df, i)      
})

df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)


现在,如果您需要跨脚本获取源代码.在Mycalculus.R中创建相同的两个函数 runIncomeCalc runResultsCalc ,然后在其他脚本中分别调用它们:


Now if you need to source across scripts. Create same two functions, runIncomeCalc and runResultsCalc in Mycalculus.R and then call each in other script:

library(dplyr)
library(GiniWegNeg)

if(!exists("runIncomeCalc", mode="function")) source("Mycalculus.R")

dfIncList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  runIncomeCalc(yeardata, i)      
})

dfResList <- lapply(unique(df$year), function(i) {      
  yeardata <- subset(df, year == i)
  df <- runIncomeCalc(yeardata, i) 
  runResultsCalc(df, i)      
})

df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)

这篇关于遍历子集,获取文件并将结果保存在数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆