anscombe数据的笨拙重塑 [英] less clunky reshaping of anscombe data
问题描述
我试图使用 ggplot2
在R中绘制内置的 anscombe
数据集(该数据集包含四个具有相同相关性但彼此之间存在根本不同关系的小数据集)X和Y).我尝试正确地重塑数据的方式都很难看.我使用了 reshape2
和base R的组合;Hadleyverse 2( dplyr
)或 data.table
解决方案对我来说很好,但理想的解决方案是
I was trying to use ggplot2
to plot the built-in anscombe
data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2
and base R; a Hadleyverse 2 (tidyr
/dplyr
) or a data.table
solution would be fine with me, but the ideal solution would be
- 简短/不重复的代码
- 易于理解(与标准#1有点冲突)
- 尽可能少地对列号等进行硬编码
原始格式:
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## ...
## 11 5 5 5 8 5.68 4.74 5.73 6.89
所需格式:
## s x y
## 1 1 10 8.04
## 2 1 8 6.95
## ...
## 44 4 8 6.89
这是我的尝试:
library("reshape2")
ff <- function(x,v)
setNames(transform(
melt(as.matrix(x)),
v1=substr(Var2,1,1),
v2=substr(Var2,2,2))[,c(3,5)],
c(v,"s"))
f1 <- ff(anscombe[,1:4],"x")
f2 <- ff(anscombe[,5:8],"y")
f12 <- cbind(f1,f2)[,c("s","x","y")]
现在的情节:
library("ggplot2"); theme_set(theme_classic())
th_clean <-
theme(panel.margin=grid::unit(0,"lines"),
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank()
)
ggplot(f12,aes(x,y))+geom_point()+
facet_wrap(~s)+labs(x="",y="")+
th_clean
推荐答案
如果您真的在处理"anscombe"数据集,那么我想说@Thela的 reshape
解决方案非常直接.
If you are really dealing with the "anscombe" dataset, then I would say @Thela's reshape
solution is very direct.
但是,这里还有其他一些可供考虑的选择:
However, here are a few other options to consider:
您可以编写自己的重塑"功能,也许像这样:
You can write your own "reshape" function, perhaps something like this:
myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
temp <- sapply(stubs, function(x) {
unlist(indf[grep(x, names(indf))], use.names = FALSE)
})
s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
data.frame(s, temp)
}
注意:
- 我不确定这是否一定比您已经在做的笨拙
- 如果数据不平衡"(例如,"x"列多于"y"列),则此方法将不起作用.
选项2:"dplyr" +"tidyr"
由于管道最近很流行,因此您也可以尝试:
Option 2: "dplyr" + "tidyr"
Since pipes are the rage these days, you can also try:
library(dplyr)
library(tidyr)
anscombe %>%
gather(var, val, everything()) %>%
extract(var, into = c("variable", "s"), "(.)(.)") %>%
group_by(variable, s) %>%
mutate(ind = sequence(n())) %>%
spread(variable, val)
注意:
- 我不确定这是否一定比您已经做的要麻烦一些,但有些人喜欢管道方法.
- 这种方法应该能够处理不平衡的数据.
选项3:"splitstackshape"
在@Arun去对 melt.data.table
进行所有出色的工作之前,我已经在我的"splitstackshape"包中编写了 merged.stack
.这样,方法将是:
Option 3: "splitstackshape"
Before @Arun went and did all that fantastic work on melt.data.table
, I had written merged.stack
in my "splitstackshape" package. With that, the approach would be:
library(splitstackshape)
setnames(
merged.stack(
data.table(anscombe, keep.rownames = TRUE),
var.stubs = c("x", "y"), sep = "var.stubs"),
".time_1", "s")[]
一些注意事项:
-
merged.stack
需要一些东西来当作"id",因此需要data.table(anscombe,keep.rownames = TRUE)
,这会增加一个列名为"rn"的行号 -
sep ="var.stubs"
基本上意味着我们实际上没有分隔符变量,因此我们将去除存根并使用剩余的时间"变量 如果数据不平衡, -
merged.stack
将起作用.例如,尝试将它与anscombe2<-anscombe [1:7]
用作数据集,而不要使用"anscombe". - 同一软件包还具有一个称为
Reshape
的功能,该功能基于reshape
来重塑不平衡的数据.但是它比merged.stack
慢且灵活.基本方法是Reshape(data.table(anscombe,keep.rownames = TRUE),var.stubs = c("x","y"),sep =")
,然后重命名使用setnames
的时间"变量.
merged.stack
needs something to treat as an "id", hence the need fordata.table(anscombe, keep.rownames = TRUE)
, which adds a column named "rn" with the row numbers- The
sep = "var.stubs"
basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable merged.stack
will work if the data are unbalanced. For instance, try using it withanscombe2 <- anscombe[1:7]
as your dataset instead of "anscombe".- The same package also has a function called
Reshape
that builds uponreshape
to let it reshape unbalanced data. But it's slower and less flexible thanmerged.stack
. The basic approach would beReshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "")
and then rename the "time" variable usingsetnames
.
选项4: melt.data.table
在上面的评论中提到了这一点,但尚未将其分享为答案.在基本R的 reshape
之外,这是一种非常直接的方法,可以从函数本身内部处理列重命名:
Option 4: melt.data.table
This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape
, this is a very direct approach that handles column renaming from within the function itself:
library(data.table)
melt(as.data.table(anscombe),
measure.vars = patterns(c("x", "y")),
value.name=c('x', 'y'),
variable.name = "s")
注意:
- 很快就会疯了.
- 比"splitstackshape"或
reshape
;-) - 很好地处理不平衡数据.
这篇关于anscombe数据的笨拙重塑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!