如何将制表符分隔的数据(不同格式)解析为data.table/data.frame? [英] How to parse tab-delimited data (of different formats) into a data.table/data.frame?
问题描述
我正在尝试分析制表符分隔的数据,该数据已与其他数据一起另存为文本文件.我希望这是一个R data.table/data.frame.
I am trying to parse tab-delimited data, which has been saved as a text file with extraneous data. I would like this to be an R data.table/data.frame.
制表符分隔的格式如下:
The tab-delimited format is the following:
A 1092 - 1093 + 1X
B 1093 HRDCPMRFYT
A 1093 + 1094 - 1X
B 1094 BSZSDFJRVF
A 1094 + 1095 + 1X
B 1095 SSTFCLEPVV
...
只有两种类型的行,A和B.A始终有5列,例如对于第一行,
There are only two types of rows, A and B. A consistently has 5 columns, e.g. for the first row,
1092 - 1093 + 1X
B始终有两列:
1093 HRDCPMRFYT
问题:如何解析具有交替"行且格式不同的文件?
Question: How do you parse a file with "alternating" rows with different formats?
让我们说这是一个文本文件,仅具有这种格式,将A和B的行交替显示,分别为5列和2列.您如何将其解析为R data.table?我的想法是如何创建以下格式:
Let's say that this was a text file which was only of this format, alternating rows of A and B, with 5 columns and 2 columns respectively. How do you parse this into an R data.table? My idea how be to create the following format:
1092 - 1093 + 1X 1093 HRDCPMRFYT
1093 + 1094 - 1X 1094 BSZSDFJRVF
1094 + 1095 + 1X 1095 SSTFCLEPVV
...
推荐答案
一种方法是使用readLines
读入数据,取出所需的位,然后传递给read.table
以形成数据帧.因此,如果行是交替的,则:
One way to go is to read in your data with readLines
, pull out the bits you want, and pass to read.table
to form the dataframe. So if the rows are alternating then:
txt <-
'1092 - 1093 + 1X
1093 HRDCPMRFYT
1093 + 1094 - 1X
1094 BSZSDFJRVF
1094 + 1095 + 1X
1095 SSTFCLEPVV'
rd <- readLines(textConnection(txt))
data.frame(read.table(text=rd[c(TRUE, FALSE)]),
read.table(text=rd[c(FALSE, TRUE)]))
将textConnection(txt)
更改为您的文件路径
另一种方法是只读一次,然后进行后处理
Another way is to read in only once and then post-process
r <- read.table(text=txt, fill=TRUE, stringsAsFactors=FALSE, na.strings = "")
d <- cbind(r[c(TRUE, FALSE),], r[c(FALSE, TRUE),])
d[ colSums(is.na(d)) < nrow(d)]
这篇关于如何将制表符分隔的数据(不同格式)解析为data.table/data.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!