使用循环创建多个命名数据框 [英] Making multiple named data frames with loop
问题描述
这是最近的解决方案。这个运行,而是在全局目录中创建一个没有任何框架的四个数据框的列表。
datafrom< - list )
年< - c(2006,2008,2010,2012)
(i为1:长度(年)){
唯一标识符< - 1:10 #< - 不是所有的数字 - 保留为字符向量
名称< - 字母[seq(from = 1,to = 10)]
Entity_Type < - factor(This,That )
Data1←rorm(10)
Data2←rorm(10)
Data3←rorm(10)
Data4←rorm(10)$ $ data_frame(UniqueID,Name,Entity_Type,Data1,Data2,Data3,Data4,Year)
$年b $ b
我想要4个独立的数据框,每个数据框都命名为datafrom2006,datafrom2008等。
$ b
非常感谢您的耐心等待我的学习。
这里很少(很多)的技术,我会打电话给他们(1)野蛮的(2)基于列表,和(3)单一的长形式data.frame。
我将在示例中添加一个函数想要应用到每个data.frame。尽管有人设想,但这有助于说明这一点:$ b
$ b pre $ $ code $ ##在b $ b年份中使用的一些常数< - c(2006年,2008,2010,2012)
n < - 10
myfunc < - function(x){
interestingPart < - x [,grepl('^ Data',colnames(x) )]
sapply(interestingPart,mean)
}
h2>
是的,你可以从一个循环创建多个like-named和same-structure data.frames,虽然它通常被许多经验丰富的( R
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ tmpdf< - data.frame(UniqueID = as.character(1:n),
Name = LETTERS [1:n],
Entity_Type = factor(c('this','that') ),
Data1 = rnorm(n),
Data2 = rnorm(n),
Data3 = rnorm(n),
Data4 = rnorm(n),
年=年)
分配(sprintf('datafrom%s',yr),tmpdf)
}
rm(yr,tmpdf)
ls()$数据从2006年开始数据从2006年开始数据从2010年开始数据从2012年开始数据从数据中获取数据来源: n = 2)
##唯一ID名称实体类型数据1数据2数据3数据4年份
## 1 1 A此1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B即-0.5646982 2.2866454 -1.7813084 0.7048373 2006
为了查看每个data.frame的结果,通常(虽然不总是)做这样的事情:
myfunc(datafrom2006)
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
myfunc(datafrom2008)
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
myfunc(datafrom2010)
# #Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
myfunc(datafrom2012)
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
基于列表
<$ ($)
datafrom< - sapply(as.character(years),function(yr){
data.frame(UniqueID = as.character(1:n) ,
Name = LETTERS [1:n],
Entity_Type = factor(c('this','that')),
Data1 = rnorm(n),
Data2 = rnorm(n),
Data3 = rnorm(n),
Data4 = rnorm(n),
Year =年)
},simplify = FALSE)
str(datafrom)
## 4
## $ 2006的列表:'data.frame':10 obs。 8个变量:
## .. $ UniqueID:因子w / 10等级1,10,2,3,..:1 3 4 5 6 7 8 9 10 2
## .. $名称:具有10个等级的因子A,B,C,D,..:1 2 3 4 5 6 7 8 9 10
## .. $ Entity_Type:因子w / 2级别that,this:2 1 2 1 2 1 2 1 2 1
## .. $ Data1:num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
## .. $ Data2:num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
## .. $ Data3:num [1:10] - 0.307 -1.781 -0.172 1.215 1.895 ...
## .. $ Data4:num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...
## .. $年份:因子w / 1级2006:1 1 1 1 1 1 1 1 1 1
## $ 2008:'data.frame':10 obs。 8个变量:
## .. $ UniqueID:因子w / 10等级1,10,2,3,..:1 3 4 5 6 7 8 9 10 2
#### ... snip ...
头(datafrom [[1]],n = 2)
##唯一ID名称实体类型数据1数据2数据3数据4年
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
head(datafrom [['2008']] ,n = 2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 This 0.2059986 0.32192527 -0.3672346 -1.04311894 2008
## 2 2 B that -0.3610573 -0.78383894 0.1852306 -0.09018639 2008
然而,您可以用一个测试函数性能:
myfunc(datafrom [[1]])
myfunc(datafrom [['2010']])
然后在所有这些函数上运行非常简单
lapply(datafrom,myfunc)
## $`2006`
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
## $`2008`
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
## $`2010`
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
## $`2012`
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
长格式数据
如果您将所有数据保留在同一个data.frame中,使用您已定义的
longdf< - do.call('rbind.data.frame',datafrom)
rownames(longdf)< - NULL
longdf [c(1,11,21,31),]
## UniqueID Name Entity_Type Data1 Dat a2 Data3 Data4 Year
## 1 1 A This 1.3709584 1.3048697 -0.3066386 0.45545012 2006
## 11 1 This 0.2059986 0.3219253 -0.3672346 -1.04311894 2008
## 21 1 A 1.5127070 1.3921164 1.2009654 -0.02509255 2010
## 31 1 A this -1.4936251 0.5676206 -0.0861073 -0.04069848 2012
简单子集: (注意:当试图绘制汇总数据时,数据处于这种形式时通常更容易,尤其是当使用 在答案t o你的评论问题,当用相同的结构做出不同的变量的时候,很容易推断你将对每个人做同样的事情,反过来或者立即连续地做。在一般的编程原则中,许多人试图概括他们所做的事情,以便如果能够完成一次,就可以在没有(严重)调整代码的情况下执行任意次数的操作。例如,比较上面两个例子中应用 In the process of learning. Didn't ask my first question well, so I'm trying again and doing my best to be more clear. I'm trying to create a series of data frames for a reproducible question for my larger issue. I would like to make 4 data frames, each named differently by the year. Eventually I will merge these four data frames to explain where I am encountering my issue. Here is the most recent solution. This runs, but instead creates a list of four data frames without any frames in the global directory. I would like 4 separate data frames, each named datafrom2006, datafrom2008, etc. Many thanks in advance for your patience with my learning. I'll demonstrate a few (of many) techniques here, and I'll call them (1) brute force, (2) list-based, and (3) single long-form data.frame. I'll add to the example the use of a function that you want to apply to each data.frame. Though contrived, it helps makes the point:
Yes, you can create multiple like-named and same-structure data.frames from a loop, though it is typically frowned upon by many experienced (R?) programmers: In order to see the results for each data.frame, one would typically (though not always) do something like this:
subset(longdf,Year == 2006)
有它的货物和其他。
by(longdf,longdf $ Year,myfunc)
library(dplyr)
,请尝试 longdf%>%filter(Year == 2010)%>%myfunc()
ggplot2
-like分层和审美。)
对暴力的理由
myfunc
所需的内容。 另外,如果您以后想要将调用的结果聚合到 myfunc
中,在强力示例中(因为您必须捕获每个返回并手动组合),而其他两个技术可以使用简单的汇总函数(例如,另一个 lapply
,或者 Filter
)。 datafrom <- list()
years <- c(2006,2008,2010,2012)
for (i in 1:length(years)) {
UniqueID <- 1:10 # <- Not all numeric - Kept as character vector
Name <- LETTERS[seq( from = 1, to = 10 )]
Entity_Type <- factor("This","That")
Data1 <- rnorm(10)
Data2 <- rnorm(10)
Data3 <- rnorm(10)
Data4 <- rnorm(10)
Year <- years[i]
datafrom[[i]] <- data.frame(UniqueID, Name, Entity_Type, Data1, Data2, Data3, Data4, Year)
}
## some constants used throughout
years <- c(2006, 2008, 2010, 2012)
n <- 10
myfunc <- function(x) {
interestingPart <- x[ , grepl('^Data', colnames(x)) ]
sapply(interestingPart, mean)
}
Brute Force
set.seed(42)
for (yr in years) {
tmpdf <- data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
assign(sprintf('datafrom%s', yr), tmpdf)
}
rm(yr, tmpdf)
ls()
## [1] "datafrom2006" "datafrom2008" "datafrom2010" "datafrom2012" "myfunc"
## [6] "n" "years"
head(datafrom2006, n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
myfunc(datafrom2006)
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
myfunc(datafrom2008)
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
myfunc(datafrom2010)
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
myfunc(datafrom2012)
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
List-Based
set.seed(42)
datafrom <- sapply(as.character(years), function(yr) {
data.frame(UniqueID=as.character(1:n),
Name=LETTERS[1:n],
Entity_Type=factor(c('this', 'that')),
Data1=rnorm(n),
Data2=rnorm(n),
Data3=rnorm(n),
Data4=rnorm(n),
Year=yr)
}, simplify=FALSE)
str(datafrom)
## List of 4
## $ 2006:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
## ..$ Name : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
## ..$ Entity_Type: Factor w/ 2 levels "that","this": 2 1 2 1 2 1 2 1 2 1
## ..$ Data1 : num [1:10] 1.371 -0.565 0.363 0.633 0.404 ...
## ..$ Data2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
## ..$ Data3 : num [1:10] -0.307 -1.781 -0.172 1.215 1.895 ...
## ..$ Data4 : num [1:10] 0.455 0.705 1.035 -0.609 0.505 ...
## ..$ Year : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1
## $ 2008:'data.frame': 10 obs. of 8 variables:
## ..$ UniqueID : Factor w/ 10 levels "1","10","2","3",..: 1 3 4 5 6 7 8 9 10 2
#### ...snip...
head(datafrom[[1]], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.4554501 2006
## 2 2 B that -0.5646982 2.2866454 -1.7813084 0.7048373 2006
head(datafrom[['2008']], n=2)
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 0.2059986 0.32192527 -0.3672346 -1.04311894 2008
## 2 2 B that -0.3610573 -0.78383894 0.1852306 -0.09018639 2008
However, with this you can test your function performance with just one:
myfunc(datafrom[[1]])
myfunc(datafrom[['2010']])
and then run the function on all of them very simply:
lapply(datafrom, myfunc)
## $`2006`
## Data1 Data2 Data3 Data4
## 0.5472968 -0.1634567 -0.1780795 -0.3639041
## $`2008`
## Data1 Data2 Data3 Data4
## -0.02021535 0.01839391 0.53907680 -0.21787537
## $`2010`
## Data1 Data2 Data3 Data4
## 0.25110630 -0.08719458 0.22924781 -0.19857243
## $`2012`
## Data1 Data2 Data3 Data4
## -0.7949660 0.2102418 -0.2022066 -0.2458678
Long-form Data
If instead you keep all of the data in the same data.frame, using your already-defined column of Year
, you can still segment it for exploring individual years:
longdf <- do.call('rbind.data.frame', datafrom)
rownames(longdf) <- NULL
longdf[c(1,11,21,31),]
## UniqueID Name Entity_Type Data1 Data2 Data3 Data4 Year
## 1 1 A this 1.3709584 1.3048697 -0.3066386 0.45545012 2006
## 11 1 A this 0.2059986 0.3219253 -0.3672346 -1.04311894 2008
## 21 1 A this 1.5127070 1.3921164 1.2009654 -0.02509255 2010
## 31 1 A this -1.4936251 0.5676206 -0.0861073 -0.04069848 2012
Simple subsets:
subset(longdf, Year == 2006)
, though subset has its goods and others.by(longdf, longdf$Year, myfunc)
- If using
library(dplyr)
, trylongdf %>% filter(Year == 2010) %>% myfunc()
(Side note: when trying to plot aggregate data, it's often easier when the data is in this form, especially when using ggplot2
-like layering and aesthetics.)
Rationale Against "Brute Force"
In answer to your comment question, when making different variables with the same structure, it is easy to deduce that you will be doing the same thing to each of them, in turn or immediately-consecutively. In general programming principle, many try to generalize what they do so that it if it can be done once, it can be done an arbitrary number of times without (heavily) adjusting the code. For instance, compare what was necessary in applying myfunc
in the two examples above.
Further, if you later want to aggregate the results from your calls to myfunc
, it is more laborious in the "brute force" example (as you must capture each return and combine manually), whereas the other two techniques can use simpler summarizing functions (e.g., another lapply
, or perhaps Reduce
or Filter
).
这篇关于使用循环创建多个命名数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!