在数据框中重新排序(删除/更改顺序)数据列 [英] Reordering (deleting/changing order) columns of data in data frame

查看:107
本文介绍了在数据框中重新排序(删除/更改顺序)数据列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大数据集,我试图重新格式化旧的数据集,使问题与新数据集的顺序相同(以便我可以轻松地在每个相同的问题上执行t检验来跟踪重大的变化数据集之间的2年)。新版本在从旧版本更改时都会被删除并添加问题。



我一直在尝试这样做的方式,R因为最好我会崩溃可以说,矢量太大了。我不知道他们怎么会变得这么大,但是!以下是我正在做的:



两个数据集都具有相同的格式。原始集合为415,新的为418。我想将新的数据集的第一个大约158个colums与旧的匹配。每个数据集的列名称为q1-q415,每列中的数据为数字1-5或NA。每个问题/列有大约100个答案,旧的数据集有更多的回应者(旧的140行,新的114行)。下面是一个例子(但是请记住,完整集和超过100行中有超过400列!)



以下是一个data.old的样例喜欢。 data.new看起来只有data.new有更多的行数/ na的答案。在这里我显示问题1到20和前10行。
data.old = 418列(q1至q418)x 140行
data.new = 415列(q1至q415)x 114行
我需要匹配前170个数据列为了做到这一点,我将从data.old中删除17列(在data.old问卷中的问题,并从data.new问卷中删除),但也是为data.old添加7个新列(其中将包含NAs ...占位符,其中data.new有新问题引入,在data.old问卷中不存在)

 > data.old 
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
1 3 4 3 3 5 4 1 NA 4 NA 1 2 NA 5 4 3 2 3 1
3 4 5 2 2 4 NA 1 3 2 5 2 NA 3 2 1 4 3 2 NA
2 NA 2 3 2 1 4 3 5 1 2 3 4 3 NA NA 2 1 2 5
1 2 4 1 2 5 2 3 2 1 3 NA NA 2 1 5 5 NA 2 3
4 3 NA 2 1 NA 3 4 2 2 1 4 5 5 NA 3 2 3 4 1
5 2 1 5 3 2 3 3 NA 2 1 5 4 3 4 5 3 NA 2 NA
NA 2 4 1 5 5 NA NA 2 NA 1 3 3 3 4 4 5 5 3 1
4 5 4 5 5 4 3 4 3 2 5 NA 2 NA 2 3 5 4 5 4
2 2 3 4 1 5 5 3 NA 2 1 3 5 4 NA 2 3 4 3 2
2 1 5 3 NA 2 3 NA 4 5 5 3 2 NA 2 3 1 3 2 4

所以在新的集合中,一些问题被删除,一些新的添加,一些改变了顺序,所以我经历了创建旧数据子集我需要再次将它们组合以匹配新数据集的顺序。当旧数据集中不存在一个问题时,我想在新的数据集中使用这个问题,以便我(理论上)可以在一个大循环中执行我的t检验。

  dataold.set1<  -  dataold [1:16] 
dataold.set2< - dataold [18:19]
dataold.set3< ; - dataold [21:23]
dataold.set4< - dataold [25:26]
dataold.set5< - dataold [30:33]
dataold.set6 < dataold [35:36]
dataold.set7< - dataold [38:39]
dataold.set8< - dataold [41:42]
dataold.set9< - dataold [ 44]
dataold.set10< - dataold [46:47]
dataold.set11< - dataold [49:54]
dataold.set12< - datanew [43:49]
dataold.set13< - dataold [62:85]
dataold.set14< - dataold [87:90]
dataold.set15< - datanew $ 78
dataold.set16< - dataold [91:142]
dataold.set17< - dataold [149:161]
dataold.set18< - dataold [55:61]
dataold。 set19< - dataold [163:170]

然后我尝试将列重新放在一组
我尝试了

  dataold.adjust<  -  merge(dataold.set1,dataold.set2)
dataold。调整< - merge(dataold.adjust,dataold.set3)
dataold.adjust< - merge(dataold.adjust,dataold.set4)

我也尝试过

  dataold.adjust<  -  cbind(dataold .set1,dataold.set2,dataold.set3)

然而,每次尝试执行这些功能,R冻结,然后崩溃。我设法让它显示一次错误,它表示它不能使用10 Mb的向量,然后我有多个错误涉及超过1000 Mb向量。我不太确定我的向量是如何大的,当这是由表3中的23列数据集合3崩溃时,我通常使用的数据集长度超过400列。 / p>

有另一种方法来做到这一点,不会导致我的程序崩溃和内存问题(并且不需要我输出超过100列的列名称),还是在这里有一些代码元素,我错过了我正在获取内存池的地方?我一直试图麻烦拍摄它,花了一个小时处理R崩溃,没有任何运气,弄清楚如何使这项工作。



感谢您的帮助! / p>

解决方案

您正在制作大量不必要的数据副本,然后您正在成长最后的对象( dataold.adjust )。你只需要一个正确地排列列的向量:

  cols1<  -  c(1:16,18:19,21 :23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <-C(62:85,87:90 )
cols3< - c(91:142,149:161,55:61,163:170)
#逐行合并旧数据/新数据,为不匹配的行添加NA
dataold.adjust< - merge(data.old [,c(cols1,cols2,cols3)],
data.new [,c(43:49,78)],by =row.names,all = TRUE)
#按需要的顺序放置列
dataold.adjust< - dataold.adjust [,c(1:length(cols1),#first cols from dataold
ncol(dataold.adjust)-length (43:49):1,#dat从datanew
(length(cols1)+1):length(cols2),#2nd cols from dataold
ncol(dataold.adjust),#2nd cols从datanew
(length(cols1)+ length(cols2)+1):length(cols3))]#data col $ from
$ / code>

最后一部分是绝对的kludge,但是我已经达到了自己的时间限制今天呢:)


I have two large data sets and I am attempting to reformat the older data set to put the questions in the same order as the newer data set (so that I can easily perform t-tests on each identical question to track significant changes over the 2 years between data sets). The new version both deleted and added questions when changing from the old version.

The way I've been attempting to do this, R keeps crashing due to, as best I can figure, vectors being too large. I'm not sure how they are getting to be this large, however! Below is what I am doing:

Both data sets have the same format. The original sets are 415 for the new and 418 for the old. I want to match the first approximately 158 colums of the new data set to the old. Each data set has column names which are q1-q415 and the data in each column is numerical 1-5 or NA. There are approximately 100 answers per question/column, the old data set has more respondants (140 rows in old vs 114 rows in new). An example is below (but keep in mind there are over 400 columns in the full set and over 100 rows!)

The following is an example of what data.old looks like. data.new looks the same only data.new has more Rows of number/na answers. Here I show questions 1 through 20 and the first 10 rows. data.old = 418 columns (q1 though q418) x 140 rows data.new = 415 columns (q1 through q415) x 114 rows I need to match the first 170 COLUMNS of data.old to the first 157 COLUMNS of data.new To do this, I will be deleting 17 columns from data.old (questions that were in the data.old questionnaire and deleted from the data.new questionnaire) but also adding 7 new columns to data.old (which will contain NAs... place holders for where data.new had new questions introducted that did not exist in data.old questionnaire)

    >data.old
    q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
    1  3  4  3  3  5  4  1  NA  4  NA  1   2    NA  5   4  3    2   3   1
    3  4  5  2  2  4  NA 1   3  2  5   2   NA   3   2   1  4    3   2   NA
    2  NA 2  3  2  1  4  3   5  1  2   3   4    3   NA  NA 2    1   2   5
    1  2  4  1  2  5  2  3   2  1  3   NA  NA   2   1   5  5    NA  2   3
    4  3  NA 2  1  NA 3  4   2  2  1   4   5    5   NA  3  2    3   4   1
    5  2  1  5  3  2  3  3  NA  2  1   5   4    3   4   5  3    NA  2   NA
    NA 2  4  1  5  5  NA NA  2  NA 1   3   3    3   4   4  5    5   3   1
    4  5  4  5  5  4  3  4   3  2  5   NA  2    NA  2   3  5    4   5   4
    2  2  3  4  1  5  5  3  NA  2  1   3   5    4   NA  2  3    4   3   2
    2  1  5  3  NA 2  3  NA  4  5  5   3   2    NA  2   3  1    3   2   4

So in the new set, some of the questions were deleted, some new ones were added, and some changed order, so I went through and created subsets of old data in the order that I would need to combine them again to match the new dataset. When a question does not exist in the old data set, I want to use the question in the new data set so that I can (theoretically) perform my t-tests in a big loop.

    dataold.set1 <- dataold[1:16]
    dataold.set2 <- dataold[18:19]
    dataold.set3 <- dataold[21:23]
    dataold.set4 <- dataold[25:26]
    dataold.set5 <- dataold[30:33]
    dataold.set6 <- dataold[35:36]
    dataold.set7 <- dataold[38:39]
    dataold.set8 <- dataold[41:42]
    dataold.set9 <- dataold[44]
    dataold.set10 <- dataold[46:47]
    dataold.set11 <- dataold[49:54]
    dataold.set12 <- datanew[43:49]
    dataold.set13 <- dataold[62:85]
    dataold.set14 <- dataold[87:90]
    dataold.set15 <- datanew[78]
    dataold.set16 <- dataold[91:142]
    dataold.set17 <- dataold[149:161]
    dataold.set18 <- dataold[55:61]
    dataold.set19 <- dataold[163:170]

I then was attempting to put the columns back together into one set I tried both

    dataold.adjust <- merge(dataold.set1, dataold.set2)
    dataold.adjust <- merge(dataold.adjust, dataold.set3)
    dataold.adjust <- merge(dataold.adjust, dataold.set4)

and I also tried

    dataold.adjust <- cbind(dataold.set1, dataold.set2, dataold.set3)

However, every time I try to perform these functions, R freezes, then crashes. I managed to get it to display an error once, and it said it could not work with a vector of 10 Mb, and then I got multiple errors involving over 1000 Mb vectors. I'm not really sure how my vectors are that large, when this is crashing out by set 3, which is only 23 columns of data in a table, and the data sets I'm normally using are over 400 columns in length.

Is there another way to do this that won't cause my program to crash and have memory issues (and won't require me typing out the column names of over 100 columns), or is there some element of code here that I am missing where I'm getting a memory sink? I've been attempting to trouble shoot it and have spent an hour dealing with R crashing without any luck figuring out how to make this work.

Thanks for the assistance!

解决方案

You're making tons of unnecessary copies of your data and then you're growing the final object (dataold.adjust). You just need a vector that orders the columns correctly:

cols1 <- c(1:16,18:19,21:23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <- c(62:85,87:90)
cols3 <- c(91:142,149:161,55:61,163:170)
# merge old / new data by row and add NA for unmatched rows
dataold.adjust <- merge(data.old[,c(cols1,cols2,cols3)],
  data.new[,c(43:49,78)], by="row.names", all=TRUE)
# put columns in desired order
dataold.adjust <- dataold.adjust[,c(1:length(cols1),  # 1st cols from dataold
  ncol(dataold.adjust)-length(43:49):1,               # 1st cols from datanew
  (length(cols1)+1):length(cols2),                    # 2nd cols from dataold
  ncol(dataold.adjust),                               # 2nd cols from datanew
  (length(cols1)+length(cols2)+1):length(cols3))]     # 3rd cols from dataold

The last part is an absolute kludge, but I've hit my self-imposed time limit for SO today. :)

这篇关于在数据框中重新排序(删除/更改顺序)数据列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆