组合一系列数据帧,并为每个数据帧创建新列 [英] Combine a series of data frames and create new columns for data in each

查看:189
本文介绍了组合一系列数据帧,并为每个数据帧创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集中每周都有一个带有工作表的Excel文件。每张纸具有相同数量的行,每一行在纸张之间是相同的(除了时间段之外...表1代表第1周,第2周,第2周等)。我正在尝试将所有的Excel工作表导入R中的一个数据框。

I have an Excel file with a sheet for each week in my data set. Each sheet has the same number of rows, and each row is identical across the sheets (with the exception of the time period… sheet 1 represents week 1, sheet 2 week 2, etc.). I'm trying to import all the Excel sheets as one data frame in R.

例如,我的数据基本上是这样构造的(有更多的列和工作表) :

For example, my data is essentially structured like this (with several more columns and sheets):

Week 1 sheet
ID    Gender    DOB    Absences    Lates    Absences_excused
1     M         1997   5           14       5
2     F         1998   4           3        2

Week 2 sheet
ID    Gender    DOB    Absences    Lates    Absences_excused
1     M         1997   2           10       3
2     F         1998   8           2        9

我正在尝试构建一个脚本,将x个张数并将它们组合成一个数据框像这样:

I'm trying to build a script that will take x numbers of sheets and combine them into one data frame like this:

Combined (ideal)
ID    Gender    DOB    Absences.1    Lates.1    Absences.2    Lates.2
1     M         1997   5             14         2             10
2     F         1998   4             3          8             2

我使用gdata导入Excel文件。

I'm using gdata to import the Excel files.

我已经尝试创建一个循环(通常不好的R,我知道...),将通过Excel文件中的所有工作表,并将其添加到列表作为数据框架:

I've tried creating a loop (normally bad for R, I know...) that will go through all the sheets in the Excel file and add each to a list as a data frame:

library(gdata)

number_sheets <- 3
all.sheets <- vector(mode="list", length=number_sheets)

for (i in 1:number_sheets) {
  all.sheets[[i]] <- read.xls("/path/to/file.xlsx", sheet=i)
}

这给我一个很好的列表, all.sheets ,我可以访问,但我不确定从特定的方式创建一个新的数据框架的最佳方式数据框列表中的列。

This gives me a nice list, all.sheets, that I can access, but I'm unsure about the best way to create a new data frame from specific columns in the list of data frames.

我已经尝试了下面的代码,通过循环查看数据帧列表来创建一个全新的数据帧。在第一个数据框中,它将保存所有表格中一致的列,然后添加特定于周的列。

I've tried the code below, which creates a brand new data frame by looping through the list of data frames. On the first data frame, it saves the columns that are consistent in all the sheets, and then adds the week-specific columns.

Cleaned <- data.frame()
number_sheets <- 3

for (i in 1:number_sheets) {
  if (i == 1) {
    Cleaned <- all.sheets[[i]][,c("ID", "Gender", "DOB")]
  }
  Cleaned$Absences.i <- all.sheets[[i]][,c("Absences")]  # wrong... obviously doesn't work... but essentially what I want
  # Other week-specific columns go here... somehow...
}

此代码不起作用,因为清理$ Absences.i 显然不是如何在数据框中创建动态列。

This code doesn't work though, since Cleaned$Absences.i is obviously not how you create dynamic columns in a data frame.

将一组数据框架,并为每个要追踪的变量创建新列?

What's the best way to combine a set of data frames and create new columns for each of the variables I'm trying to track?

额外的障碍:我也试图结合两列,缺席和缺席一个单一的缺席栏在最后的数据框架中,所以我试图让我的解决方案让我对新的列执行转换(再次,这是不正确的):

Extra hurdle: I'm also trying to combine two columns, "Absences" and "Absences_excused" into a single "Absences" column in the final data frame, so I'm trying to make my solution let me perform transformations to the new columns, like so (again, this isn't right):

Cleaned$Absences.i <- all.sheets[[i]][,c("Absences")] + all.sheets[[i]][,c("Absences_excused")]  


推荐答案

@ DWin我认为海报的问题比这个例子让我们相信更复杂一些。我认为海报想要多合并,如第1周,第2周第2周等所示。我的方法有点不同在合并之前,可以使用转换来补充额外的障碍。这是我使用3个数据帧而不是2的合并解决方案。

@ DWin I think the poster's problem is a little more complex than the example leads us to believe. I think the poster wants a multi merge as indicated by "week 1, sheet 2 week 2, etc.". My approach is a bit different. The extra hurdle can be taken care of before the merge using lapply with transform. Here's my solution for the merge using 3 data frames instead of 2.

#First read in three data frames
Week_1_sheet <- read.table(text="ID Gender  DOB Absences Unexcused_Absences Lates
1  1      M 1997        5                  1    14
2  2      F 1998        4                  2     3", header=TRUE)

Week_2_sheet <- read.table(text="ID Gender  DOB Absences Unexcused_Absences Lates
1  1      M 1997        2                  1    10
2  2      F 1998        8                  2     2
3  3      M 1998        8                  2     2", header=TRUE)

Week_3_sheet <- read.table(text="ID Gender  DOB Absences Unexcused_Absences Lates
1  1      M 1997        2                  1    10
2  2      F 1998        8                  2     2", header=TRUE)

#Put them into a list structure
WEEKlist <- list(Week_1_sheet , Week_2_sheet , Week_3_sheet)

#Transform to add the absences and unexcused absences and drop unexcused
lapply(seq_along(WEEKlist), function(x) {
    WEEKlist[[x]] <<- transform(WEEKlist[[x]], Absences=sum(Absences,
        Unexcused_Absences))[, -5]
    }
)

#Rename each data frame in the list with `<<-` that acts on environments
lapply(seq_along(WEEKlist), function(x) {
    y <- names(WEEKlist[[x]])
    names(WEEKlist[[x]]) <<- c(y[1:3], paste(y[4:length(y)], ".", x, sep=""))
    }
)

#loop through and merge by the common columns
DF <- WEEKlist[[1]][, 1:3]
for (.df in WEEKlist) { 
     DF <-merge(DF, .df, by=c('ID', 'Gender', 'DOB'), all=TRUE, suffixes=c("", ""))
}

DF

第二种方法(重命名数据框列之后)使用减少:
取自(LINK)

merge.all <- function(frames, by) {
    return (Reduce(function(x, y) {merge(x, y, by = by, all = TRUE)}, frames))
}

merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB'))

我不知道哪一个更快。

I'm not sure which one is faster though.

编辑:在运行1000次迭代的Windows 7机器上,Reduce更快:

    test replications elapsed relative user.self sys.self
1   LOOP         1000   10.12  1.62701      7.89        0
2 REDUCE         1000    6.22  1.00000      5.34        0

这篇关于组合一系列数据帧,并为每个数据帧创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆