复制数据框的观察值,但也替换R中的特定变量值 [英] Duplicating observations of a dataframe, but also replacing specific variable values in R

查看:54
本文介绍了复制数据框的观察值,但也替换R中的特定变量值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关一些数据重组的建议.我正在使用Google Forms收集一些数据,并将其下载为csv文件,其外观类似于以下内容:

I am looking for some advice on some data restructuring. I am collecting some data using Google Forms which I download as a csv file and looks something like the following:

# alpha                 beta    option
#  6             8, 9, 10, 11    apple
#  9                        6     pear
#  1                        6    apple
#  3                     8, 9     pear
#  3                     6, 8     lime
#  3                        1    apple
#  2, 4, 7, 11              9     lime

数据具有两个变量(alpha和beta),每个变量都编号.对于我的大部分数据,每个变量中只有一个数字.但是,对于某些观察,可以有两个,三个甚至最多十个数字.这是因为这些是使用Google表单中的复选框"选项收集的答复,该答复允许对一个调查问题的多个答案.另外,对于某些潜在的解决方案来说,Google表单在多个答案中的每一个答案之前返回前导空格也可能很重要.

The data has two variables (alpha and beta) that each list numbers. For the majority of my data there is only one number in each variable. However, for some observations there can be two, three or even up to ten numbers. This is because these are responses gathered using the 'checkbox' option in google forms which allows multiple answers to one survey question. Also, it may be of importance to some potential solutions that google forms returns leading whitespace before each of the multiple answers.

在我的真实数据中,这仅发生在所有观察结果中的很小一部分,以上是一个更简洁的示例.数据集中还有其他几个变量.在这里,我只包括一个包含因素的称为选项"的东西.

In my real data this occurs in only a very small proportion of all observations, the above is a more condensed example. There are several other variables in the dataset. Here I am just including one called 'option' that contains factors.

我需要做的是复制所有在'alpha'或'beta'变量中包含多个数字的观察值.复制的行数应等于alpha或beta变量中存在的数字数.然后,我需要分别用每个数字替换"alpha"或"beta"变量中的数字序列.这将导致如下所示:

What I need to do is to duplicate all observations that contain multiple numbers in either the 'alpha' or the 'beta' variable. The number of duplicated rows should be equal to the number of numbers that exist in the alpha or beta variable. Then, I need to replace the sequence of numbers in the 'alpha' or 'beta' variables with each number independently. That would result in something like the following:

#  alpha  beta   option
#     6    8     apple
#     6    9     apple
#     6   10     apple
#     6   11     apple
#     9    6      pear
#     1    6     apple
#     3    8      pear
#     3    9      pear
#     3    6      lime
#     3    8      lime
#     3    1     apple
#     2    9      lime
#     4    9      lime
#     7    9      lime
#    11    9      lime

以下是复制上面原始示例数据的数据.我将数据框称为演示":

Here is the data that reproduces the original example data above. I have called the dataframe 'demo':

demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label =
 c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L, 
4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class =   
"factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple", 
"lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class =   
"data.frame", row.names = c(NA, -7L))

好.因此,我想我已经写了一些代码,这些代码以漫长的方式确实导致了我正在寻找的新数据帧.但是,感觉好像必须有一种更优雅,更好的方法来做到这一点.

OK. So I think I have written some code that in a very long-winded fashion does lead to the new dataframe I am looking for. However, it feels like there must be a more elegant and better way of doing it.

基本上,我首先处理'alpha'变量.我首先根据变量中是否存在逗号来对观察结果进行子集化.对于包含逗号的观察值,然后使用strsplit分隔数字.然后,我计算每个观察有多少个数字,并以此重复每个观察.然后,我将拆分的数字融合到一个数据帧中,所有数字都存储在一个名为值"的变量中.然后,我简单地用融化的值"变量中的数据替换"alpha"变量.然后,我用不包含逗号的数据重新整理了一下.然后,我使用此df并处理'beta'变量....

Basically, I work on the 'alpha' variable first. I first subset the observations based on whether commas exist in the variable or not. With the observations that contain commas, I then use strsplit to separate the numbers. I then count how many numbers exist for each observation and duplicate each observation by that. I then melt the split numbers into a dataframe with all the numbers in a variable named 'value'. I simply then replace the 'alpha' variable with the data in the melted 'value' variable. I then rbind this back with the data that did not contain commas. I then use this df and work on the 'beta' variable....

这是我的解决方案(似乎可行吗?):

Here is my solution (it seems to work?):

library(reshape2)

demo$a<-grepl(",", demo$alpha)
demo.atrue <- demo[ which(demo$a=='TRUE'), ]
demo.afalse <- demo[ which(demo$a=='FALSE'), ]
demo.atrue$alpha<-as.character(demo.atrue$alpha)
temp<-strsplit(demo.atrue$alpha, ",")
temp.lengths<-lapply(temp, length)

for (i in 1:length(temp)) { 
df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3]
}

temp.melt<-melt(temp)
df.expanded$alpha<-temp.melt$value
demo.afalse<-demo.afalse[c(1:3)]
demonew<-rbind(demo.afalse, df.expanded)



demonew$b<-grepl(",", demonew$beta)
demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ]
demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ]
demonew.btrue$beta<-as.character(demonew.btrue$beta)

temp<-strsplit(demonew.btrue$beta, ",")
temp.lengths<-lapply(temp, length)

for (i in 1:length(temp)) { 
  df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3]
}

temp.melt<-melt(temp)
df.expanded1$beta<-temp.melt$value
demonew.bfalse<-demonew.bfalse[c(1:3)]
demonew1<-rbind(df.expanded1, demonew.bfalse)

demonew1  #this seems to work, but doesn't feel very efficient

虽然效率可能不是很高,但我不确定这是否在所有条件下都有效.尤其是对于同一观察,如果'alpha'和'beta'变量中都存在多个数字.我已经通过一些示例对其进行了测试,似乎还可以,但是我对此并不自信.

As well as perhaps not being very efficient, I am not sure whether this will work in all conditions. In particular if multiple numbers exist in both the 'alpha' and 'beta' variables for the same observation. I have tested it with a few examples and it seems ok, but I am not confident with it.

谢谢您的考虑.

推荐答案

您可以使用我的 cSplit >函数,嵌套两次,就像这样:

You can use my cSplit function, nested twice, like this:

cSplit(cSplit(demo, "alpha", ",", "long"), "beta", ",", "long")
#     alpha beta option
#  1:     6    8  apple
#  2:     6    9  apple
#  3:     6   10  apple
#  4:     6   11  apple
#  5:     9    6   pear
#  6:     1    6  apple
#  7:     3    8   pear
#  8:     3    9   pear
#  9:     3    6   lime
# 10:     3    8   lime
# 11:     3    1  apple
# 12:     2    9   lime
# 13:     4    9   lime
# 14:     7    9   lime
# 15:    11    9   lime


一些基准:


Some benchmarks:

更多有趣的样本数据. 700行而不是7行(仍然是一个很小的数据集)...

More interesting sample data. 700 rows instead of 7 (still quite a small dataset)...

demo <- do.call(rbind, replicate(100, demo, FALSE))
library(data.table)
demo2 <- data.table(demo)

测试功能...

## MrFlick's
fun1 <- function() {
  do.call(rbind, with(demo, Map(expand.grid,
                                alpha = strsplit(alpha,", "),
                                beta = strsplit(beta, ", "),
                                option = option
  )))
} 

## Mine
fun2 <-  function() {
  cSplit(cSplit(demo2, "alpha", ",", "long"), "beta", ",", "long")
} 

## thelatemail's one-liner
fun3 <- function() {
  do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", "))))
} 

实际基准测试...

library(microbenchmark)
microbenchmark(MF = fun1(), AM = fun2(), TH = fun3(), times = 10)
# Unit: milliseconds
#  expr       min        lq    median        uq       max neval
#    MF 785.34875 789.94924 800.11046 800.93643 813.62390    10
#    AM  11.54569  11.93483  12.14181  12.31329  12.93208    10
#    TH 790.46069 799.68518 803.47294 827.69520 899.11219    10

这篇关于复制数据框的观察值,但也替换R中的特定变量值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆