anscombe数据的笨拙重塑 [英] less clunky reshaping of anscombe data

查看:58
本文介绍了anscombe数据的笨拙重塑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用 ggplot2 在R中绘制内置的 anscombe 数据集(该数据集包含四个具有相同相关性但彼此之间存在根本不同关系的小数据集)X和Y).我尝试正确地重塑数据的方式都很难看.我使用了 reshape2 和base R的组合;Hadleyverse 2(/ dplyr )或 data.table 解决方案对我来说很好,但理想的解决方案是

I was trying to use ggplot2 to plot the built-in anscombe data set in R (which contains four different small data sets with identical correlations but radically different relationships between X and Y). My attempts to reshape the data properly were all pretty ugly. I used a combination of reshape2 and base R; a Hadleyverse 2 (tidyr/dplyr) or a data.table solution would be fine with me, but the ideal solution would be

  • 简短/不重复的代码
  • 易于理解(与标准#1有点冲突)
  • 尽可能少地对列号等进行硬编码

原始格式:

 anscombe
 ##     x1 x2 x3 x4    y1   y2   y3     y4
 ##  1  10 10 10  8  8.04 9.14  7.46  6.58
 ##  2   8  8  8  8  6.95 8.14  6.77  5.76
 ##  3  13 13 13  8  7.58 8.74 12.74  7.71
 ## ...
 ## 11   5  5  5  8  5.68 4.74  5.73  6.89

所需格式:

 ##    s  x    y
 ## 1  1 10 8.04
 ## 2  1  8 6.95
 ## ...
 ## 44 4  8 6.89

这是我的尝试:

 library("reshape2")
 ff <- function(x,v) 
     setNames(transform(
        melt(as.matrix(x)),
             v1=substr(Var2,1,1),
             v2=substr(Var2,2,2))[,c(3,5)],
          c(v,"s"))
 f1 <- ff(anscombe[,1:4],"x")
 f2 <- ff(anscombe[,5:8],"y")
 f12 <- cbind(f1,f2)[,c("s","x","y")]

现在的情节:

 library("ggplot2"); theme_set(theme_classic())
 th_clean <- 
  theme(panel.margin=grid::unit(0,"lines"),
    axis.ticks.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.y=element_blank(),
    axis.text.y=element_blank()
    )
ggplot(f12,aes(x,y))+geom_point()+
  facet_wrap(~s)+labs(x="",y="")+
  th_clean

推荐答案

如果您真的在处理"anscombe"数据集,那么我想说@Thela的 reshape 解决方案非常直接.

If you are really dealing with the "anscombe" dataset, then I would say @Thela's reshape solution is very direct.

但是,这里还有其他一些可供考虑的选择:

However, here are a few other options to consider:

您可以编写自己的重塑"功能,也许像这样:

You can write your own "reshape" function, perhaps something like this:

myReshape <- function(indf = anscombe, stubs = c("x", "y")) {
  temp <- sapply(stubs, function(x) {
    unlist(indf[grep(x, names(indf))], use.names = FALSE)
  })
  s <- rep(seq_along(grep(stubs[1], names(indf))), each = nrow(indf))
  data.frame(s, temp)
}

注意:

  1. 我不确定这是否一定比您已经在做的笨拙
  2. 如果数据不平衡"(例如,"x"列多于"y"列),则此方法将不起作用.

选项2:"dplyr" +"tidyr"

由于管道最近很流行,因此您也可以尝试:

Option 2: "dplyr" + "tidyr"

Since pipes are the rage these days, you can also try:

library(dplyr)
library(tidyr)

anscombe %>%
  gather(var, val, everything()) %>%
  extract(var, into = c("variable", "s"), "(.)(.)") %>% 
  group_by(variable, s) %>%
  mutate(ind = sequence(n())) %>%
  spread(variable, val)

注意:

  1. 我不确定这是否一定比您已经做的要麻烦一些,但有些人喜欢管道方法.
  2. 这种方法应该能够处理不平衡的数据.

选项3:"splitstackshape"

在@Arun去对 melt.data.table 进行所有出色的工作之前,我已经在我的"splitstackshape"包中编写了 merged.stack .这样,方法将是:

Option 3: "splitstackshape"

Before @Arun went and did all that fantastic work on melt.data.table, I had written merged.stack in my "splitstackshape" package. With that, the approach would be:

library(splitstackshape)
setnames(
  merged.stack(
    data.table(anscombe, keep.rownames = TRUE), 
               var.stubs = c("x", "y"), sep = "var.stubs"), 
  ".time_1", "s")[]

一些注意事项:

  1. merged.stack 需要一些东西来当作"id",因此需要 data.table(anscombe,keep.rownames = TRUE),这会增加一个列名为"rn"的行号
  2. sep ="var.stubs" 基本上意味着我们实际上没有分隔符变量,因此我们将去除存根并使用剩余的时间"变量
  3. 如果数据不平衡,
  4. merged.stack 将起作用.例如,尝试将它与 anscombe2<-anscombe [1:7] 用作数据集,而不要使用"anscombe".
  5. 同一软件包还具有一个称为 Reshape 的功能,该功能基于 reshape 来重塑不平衡的数据.但是它比 merged.stack 慢且灵活.基本方法是 Reshape(data.table(anscombe,keep.rownames = TRUE),var.stubs = c("x","y"),sep ="),然后重命名使用 setnames 的时间"变量.
  1. merged.stack needs something to treat as an "id", hence the need for data.table(anscombe, keep.rownames = TRUE), which adds a column named "rn" with the row numbers
  2. The sep = "var.stubs" basically means that we don't really have a separator variable, so we'll just strip out the stub and use whatever remains for the "time" variable
  3. merged.stack will work if the data are unbalanced. For instance, try using it with anscombe2 <- anscombe[1:7] as your dataset instead of "anscombe".
  4. The same package also has a function called Reshape that builds upon reshape to let it reshape unbalanced data. But it's slower and less flexible than merged.stack. The basic approach would be Reshape(data.table(anscombe, keep.rownames = TRUE), var.stubs = c("x", "y"), sep = "") and then rename the "time" variable using setnames.

选项4: melt.data.table

在上面的评论中提到了这一点,但尚未将其分享为答案.在基本R的 reshape 之外,这是一种非常直接的方法,可以从函数本身内部处理列重命名:

Option 4: melt.data.table

This was mentioned in the comments above, but hasn't been shared as an answer. Outside of base R's reshape, this is a very direct approach that handles column renaming from within the function itself:

library(data.table)
melt(as.data.table(anscombe), 
     measure.vars = patterns(c("x", "y")), 
     value.name=c('x', 'y'), 
     variable.name = "s")

注意:

  1. 很快就会疯了.
  2. 比"splitstackshape"或 reshape ;-)
  3. 很好地处理不平衡数据.

这篇关于anscombe数据的笨拙重塑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆