读取csv与两个标题到一个data.frame [英] Read csv with two headers into a data.frame

查看:105
本文介绍了读取csv与两个标题到一个data.frame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于看似简单的问题抱歉,但我似乎找不到解决以下重新安排问题的解决方案。



我习惯于使用 read.csv 读取带有标题行的文件,但是我有具有两个标题行的excel电子表格 - 每个单元格的单元格标识符(a,b,c ... g)和三组测量(x,y和z; 1000s):

  ab 
xyzxyz
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9

csv以下文件:

  a ,,, b ,, 
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1 ,4,33,2,8
12,2,5,44,1,9

如何在R中找到一个data.frame,如下所示?

  cell xyz 
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9


解决方案

使用基础R reshape()

  temp = read.delim(text =a ,,, b ,, 
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1 ,4,33,2,8
12,2,5,44,1,9,header = TRUE,skip = 1,sep =,)
names(temp)[1: 3] = paste0(name(temp [1:3]),.0)
OUT = reshape(temp,direction =long,ids = rownames(temp),vary = 1:ncol ))
OUT
#time xyz id
#1.0 0 10 1 5 1
#2.0 0 12 2 6 2
#3.0 0 12 2 7 3
#4.0 0 13 1 4 4
#5.0 0 12 2 5 5
#1.1 1 22 1 6 1
#2.1 1 21 3 5 2
#3.1 1 11 3 7 3
#4.1 1 33 2 8 4
#5.1 1 44 1 9 5

基本上,你应该跳过第一行,每第三列有字母ag。由于子列名称全部相同,因此R将在第三列之后的所有列之后自动附加分组号;所以我们需要在前三列添加一个分组号。



你可以创建一个id变量,或者像我在这里做的那样使用行名作为ID。



您可以将time变量更改为cell变量,如下所示:

 #将以下内容更改为您实际拥有的级别
OUT $ cell = factor(OUT $ time,labels = letters [1:2])

然后,删除时间列:

  OUT $ time = NULL 



更新



要在下面的评论中回答一个问题,如果第一个标签是一个不是一个字母的标签,这个问题应该是没有问题的。我将采取的顺序如下:

  temp = read.csv(path / to / file.csv skip = 1,stringsAsFactors = FALSE)
GROUPS = read.csv(path / to / file.csv,header = FALSE,
nrows = 1,stringsAsFactors = FALSE)
GROUPS = GROUPS [!is.na(GROUPS)]
名称(temp)[1:3] = paste0(名称(temp [1:3]),.0)
OUT = reshape ,direction =long,ids = rownames(temp),vary = 1:ncol(temp))
OUT $ cell = factor(temp $ time,labels = GROUPS)
OUT $ time = NULL


Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.

I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:

a           b       
x    y  z   x   y   z
10   1  5   22  1   6
12   2  6   21  3   5
12   2  7   11  3   7
13   1  4   33  2   8
12   2  5   44  1   9

csv file below:

a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9

How can I get to a data.frame in R as shown below?

cell x  y   z
a    10 1   5
a    12 2   6
a    12 2   7
a    13 1   4
a    12 2   5
b    22 1   6
b    21 3   5
b    11 3   7
b    33 2   8
b    44 1   9

解决方案

Use base R reshape():

temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
#     time  x y z id
# 1.0    0 10 1 5  1
# 2.0    0 12 2 6  2
# 3.0    0 12 2 7  3
# 4.0    0 13 1 4  4
# 5.0    0 12 2 5  5
# 1.1    1 22 1 6  1
# 2.1    1 21 3 5  2
# 3.1    1 11 3 7  3
# 4.1    1 33 2 8  4
# 5.1    1 44 1 9  5

Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.

You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.

You can change the "time" variable to your "cell" variable as follows:

# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])

Then, drop the "time" column:

OUT$time = NULL

Update

To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:

temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE, 
                  nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

这篇关于读取csv与两个标题到一个data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆