如何读取变量存储为行的数据集,以及一些变量名称包含“#”的数据集。 [英] How to read a data set where variables are stored as rows, and some variable names contain "#"

查看:158
本文介绍了如何读取变量存储为行的数据集,以及一些变量名称包含“#”的数据集。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的(大约1200个变量,每个有14个观察值)数据集,存储在一个文本文件中,非常奇怪,绝对不是整洁结构。实际上,每个变量都存储为一行而不是一列,第一行和第二行分别是变量名和该变量的测量单位。以下是一个示例数据集:

I have a fairly large (about 1200 variables with 14 observations each) data set, stored in a text file, with a very weird and definitely not tidy structure. In practice each variable is stored as a row, instead than as a column, and the first and second row are respectively the variable name, and the measurement unit for that variable. Here is a sample data set:

Date    --- 1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016
PT-AMB#SRV  V   1.403400    1.403207    1.403265    1.403326    1.403454    1.403783    1.404924    1.404962    1.405291    1.404951    1.404685    1.404812    1.404433    1.404428
PS1-SEC20#SRV   V   2.395769    2.416003    2.362276    2.253045    2.139873    1.939328    2.450442    2.294791    2.085946    1.929666    2.634747    3.067008    3.081949    3.095456

第一个变量名为日期并且它是无量纲的(单位 --- ),第二个是 PT-AMB#SRV 并以伏特为单位测量 V ,依此类推。注意:同一行上的两个条目由原始文件中的选项卡分隔。一旦我在Stack Overflow上复制和粘贴数据,我不确定这是否保存。

The first variable is called Date and it's nondimensional (units ---), the second one is PT-AMB#SRV and measured in volts V, and so on. NOTE: two entries on the same row are separated by a tab in the original file. I'm not sure if this is conserved once I copy&paste data here on Stack Overflow.

首先,我尝试读取这样的数据:

First of all, I tried to read in data like this:

df=read.table("TEST.txt",sep="\t")

我收到以下错误:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 2 did not have 16 elements

错误消失如果我手动编辑第二个(和第三个)变量名称,将更改为 -

The error goes away if I manually edit the second (and the third) variable names, changing the # to a -.

第一个问题是:为什么会发生这种情况,如何防止它发生?如果我需要更改所有 - 在所有变量名中,我该如何自动执行?最好在 R 中,否则命令行是很好(我在Windows工作)。

The first question is: why is this happening, and how can I prevent it from happening? If I need to change all # to - in all variable names, how can I do that automatically? Preferably in R, otherwise the command line is fine (I work in Windows).

第二个问题:修改所有后(此示例中只有两个)数据集),我用

Second question: after modifying all the # (just two in this sample data set), I read it with

 df=read.table("TEST.txt",sep="\t")

我得到:

             V1  V2        V3        V4        V5        V6        V7        V8        V9       V10       V11       V12       V13       V14       V15       V16
1          Date --- 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016 1/19/2016
2    PT-AMB-SRV   V  1.403400  1.403207  1.403265  1.403326  1.403454  1.403783  1.404924  1.404962  1.405291  1.404951  1.404685  1.404812  1.404433  1.404428
3 PS1-SEC20-SRV   V  2.395769  2.416003  2.362276  2.253045  2.139873  1.939328  2.450442  2.294791  2.085946  1.929666  2.634747  3.067008  3.081949  3.095456

然后尝试转置 df ,以便变量存储在列中:

I then try to transpose df, so that variables are stored in columns:

df_t=t(df)

我得到:

    [,1]        [,2]         [,3]           
V1  "Date"      "PT-AMB-SRV" "PS1-SEC20-SRV"
V2  "---"       "V"          "V"            
V3  "1/19/2016" "1.403400"   "2.395769"     
V4  "1/19/2016" "1.403207"   "2.416003"     
V5  "1/19/2016" "1.403265"   "2.362276"     
V6  "1/19/2016" "1.403326"   "2.253045"     
V7  "1/19/2016" "1.403454"   "2.139873"     
V8  "1/19/2016" "1.403783"   "1.939328"     
V9  "1/19/2016" "1.404924"   "2.450442"     
V10 "1/19/2016" "1.404962"   "2.294791"     
V11 "1/19/2016" "1.405291"   "2.085946"     
V12 "1/19/2016" "1.404951"   "1.929666"     
V13 "1/19/2016" "1.404685"   "2.634747"     
V14 "1/19/2016" "1.404812"   "3.067008"     
V15 "1/19/2016" "1.404433"   "3.081949"     
V16 "1/19/2016" "1.404428"   "3.095456" 

不再是数据框,而是一个字符数组。绝对不是我想要的。我怎样才能使变量存储在列中(整齐的数据集)?我认为问题是包含测量单位的列,但在转置之前将其删除

No more a data frame, but an array of characters. Definitely not what I want to. How can I make it so that variables are stored in columns (tidy data set)? I thought the problem was the column containing the measurement units, but removing it before transposing with

df[,"V2"]=NULL   

没有解决任何问题。也许 tidyr 可以在这里提供帮助,但我不知道如何。

doesn't solve anything. Maybe tidyr could help here, but I don't know how.

推荐答案

将数据读入DF0,转置它并使用 type.convert 来获取列的相应类。设置名称并使用适当的格式字符串将第一列转换为Date class。

Read the data into DF0, transpose it and use type.convert to get appropriate classes for the columns. Set the names and convert the first column to "Date" class using the appropriate format string.

# replace text = Lines with file = "myfile.dat"
DF0 <- read.table( text = Lines, colClasses = "character", comment = "" )
L <- lapply( as.data.frame( tail( t(DF0), -2 ), stringsAsFactors = FALSE ), type.convert )
DF <- setNames( as.data.frame(L), DF0[[1]] )
DF$Date <- as.Date( DF$Date, format = "%m/%d/%Y" )

结果是:

> DF
         Date PT-AMB#SRV PS1-SEC20#SRV
1  2016-01-19   1.403400      2.395769
2  2016-01-19   1.403207      2.416003
3  2016-01-19   1.403265      2.362276
4  2016-01-19   1.403326      2.253045
5  2016-01-19   1.403454      2.139873
6  2016-01-19   1.403783      1.939328
7  2016-01-19   1.404924      2.450442
8  2016-01-19   1.404962      2.294791
9  2016-01-19   1.405291      2.085946
10 2016-01-19   1.404951      1.929666
11 2016-01-19   1.404685      2.634747
12 2016-01-19   1.404812      3.067008
13 2016-01-19   1.404433      3.081949
14 2016-01-19   1.404428      3.095456

注意:我们使用此输入:

Lines <- 
"Date    --- 1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016   1/19/2016
PT-AMB#SRV  V   1.403400    1.403207    1.403265    1.403326    1.403454    1.403783    1.404924    1.404962    1.405291    1.404951    1.404685    1.404812    1.404433    1.404428
PS1-SEC20#SRV   V   2.395769    2.416003    2.362276    2.253045    2.139873    1.939328    2.450442    2.294791    2.085946    1.929666    2.634747    3.067008    3.081949    3.095456"

这篇关于如何读取变量存储为行的数据集,以及一些变量名称包含“#”的数据集。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆