使用R导入格式异常的文本数据 [英] Import unusually formatted text data using R
问题描述
我从一台设备上输出了数据.不幸的是,输出数据的组织不是很好,我一直在用R编写代码来分解它.本质上,数据是粘贴到一个长文档中的每个主题的单独信息列表(基本描述信息,以及每个时间间隔的两个不同测量A和B的原始数据).例如:
I have output data from a piece of equipment. Unfortunately the output data is not organized very well, and I have been writing a code in R to break it down. Essentially the data is a separate list of information (basic descriptive information, and raw data for two different measurements A and B for each time interval) for each subject pasted into one long document. For example:
Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: Subject2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
我已经用R编写了一个有效的代码,但是使用split(seq_along),for循环和do.call(不是很优雅)(主要基于此博客帖子时,请调用而不是循环.
I have written a code in R that works, but is not very elegant, using split(seq_along), for-loops, and do.call (based primarily on this stack overflow question and this blog post).
# First read text file in as a character vector called ‘example’
scan("example_file.txt", what="character", strip.white=T, sep="\n") -> example
# Separate the header text (before the colon) from the proceeding data
# and make that text name the components of the vector
regmatches(example, regexpr(example, pattern="[[:alnum:]]+:", useBytes = F)) -> names(example)
gsub(example, pattern="[[:print:]]+: ", replacement="", useBytes = F)-> example.2
# Then, split character vector into a list based on how many lines are
# dedicated to each subject (in this example, 11 lines); based on SE
# answer cited above
strsplit(example.2, "([A-Z]:)") -> example.3
split(as.list(example.3), ceiling(seq_along(example.2)/11)) -> example.4
# Use a for-loop to systematically add the data together for subjects 1
# and 2 for time interval 1, using the method detailed from a blog post
# (cited above)
my.list <- list()
for(i in 1:2){
strsplit(as.character(example.4[[i]][5]), split="[[:blank:]]+") -> A
strsplit(as.character(example.4[[i]][9]), split="[[:blank:]]+")-> B
as.vector(c(as.character(example.4[[i]][3]), "A", unlist(A))) -> A_char
as.vector(c(as.character(example.4[[i]][3]), "B", unlist(B))) -> B_char
paste(as.character(example.4[[i]][3]), "Measure_A") -> a_name
paste(as.character(example.4[[i]][3]), "Measure_B") -> b_name
my.list[[a_name]] <- A_char
my.list[[b_name]] <- B_char
}
final.data <- do.call(rbind, my.list)
as.data.frame(final.data) -> final.data
names(final.data) <- c("Subject", "Measure", "V1", "V2", "V3", "V4")
我可以使用我的代码(例如,上面的行"1:1 2 4 1"和"1:2 3 0 1")提取所有对象在A和B的单个时间间隔内的数据,并放入将所有信息放到一个数据框中.当我想在所有时间间隔( all )而不是一个时间间隔内执行此操作时,会变得混乱.我不知道如何在没有为每个时间间隔运行单独的for循环的情况下执行此操作.我尝试在for循环内执行for循环,但这没有用.我也不清楚如何使用apply()类型的函数.
I can extract the data for a single time interval for A and B across all subjects using my code (for example, the lines "1: 1 2 4 1" and "1: 2 3 0 1" above) and put put all the information together in a data frame. Where is gets messy is when I want to do this for all of the time intervals, not just one time interval. I can't figure out how to do this without running separate for-loops for each time interval. I tried doing a for-loop within a for-loop, but that didn't work. I also couldn’t figure out how to do this with the apply()-type functions.
按照本例,如果我只有3个时间间隔,这个问题就不会那么严重,但是我的实际数据要长得多.任何更优雅和简洁的方法的建议,将不胜感激!
If I only had 3 time intervals, as per this example, this issue wouldn’t be so bad, but my actual data is a lot longer. Any suggestions for a more elegant and concise approach would be appreciated!
P.S.我知道上述代码给出的最终数据帧具有多余的行名.但是,这是确保最终数据框的主题和度量信息与我应用于早期R对象的标签对齐的一种有用方法.
P.S. I am aware that the final data frame that the above code gives has redundant row names. However, this is a helpful way of making sure that the final data frame’s subject and measure information lines up with the labels I had applied to earlier R objects.
推荐答案
除行名外,此操作不执行任何操作:
This does everything but the rownames:
lines <- readLines(textConnection("Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
3: 3 3 2 4"))
非基本R解决方案需要一些库:
Some libraries we'll need for the non-base R solution:
library(purrr)
library(tibble)
library(tidyr)
library(dplyr)
修剪空白并过滤出空白行:
Trim whitespace and filter out blank lines:
trimws(lines) %>% discard(`==`, "") -> lines
这是记录开始的行
中索引的向量(通过在行首找到 Date:
来指定):
This makes a vector of the indexes in lines
where the records start (which is designated by finding Date:
at the beginning of a line):
starts <- which(grepl("^Date:", lines))
现在,我们从头开始,寻找 Date:
的 next 出现(即下一条记录).它将找到所有这些,因此我们只关心第一个.为了计算该索引,我们添加起始索引并减去 1
.从理论上讲,只有一个 NA
(即最后一个记录),但是我们懒惰地使用 ifelse
而不是只更改最后一个.
Now, we take those starts and look for the next occurrence of Date:
(i.e. the next record). It's going to find them all, so we only care about the first one. To calculate that index, we add the start index and subtract 1
. In theory there will only be one NA
(i.e. the last record) but we lazily use ifelse
vs just change he last one.
ends <- map_dbl(starts, function(i) {
which(grepl("^Date:", lines[(i+1):length(lines)]))[1]+i-1
})
ends <- ifelse(is.na(ends), length(lines), ends)
因此,现在 starts
包含每个记录的开始索引,而 ends
包含每个记录的结束索引.
So, now starts
contains the indexes of the start of each record and ends
contains the indexes of the ends of each record.
map2_df()
是 mapply()
&的超级方便的伪包装器. do.call(rbind,...)
.我们使用这些格式为DCF格式(键:值
)并使用 read.dcf()
的事实.这样就构成了一个矩阵,然后我们对其进行重新定向并将其转换为data.frame.
The map2_df()
is super handy pseudo-wrapper for mapply()
& do.call(rbind,…)
. We use the fact that these are in DCF format (key: value
) and use read.dcf()
. That makes a matrix and we then re-orient it and turn it into a data.frame.
然后我们将值分开,添加行名以创建 time_interval
列,添加日期,时间和主题,并确保这些列是正确的类型.
We then separate the values, add the row names to make a time_interval
column, add in the date, time and subject and make sure the columns are the right type.
如果我们告诉我们, map2_df()
将使用命名列表键"作为列的事实.
We also use the fact that map2_df()
will use the named list "keys" as a column if we tell it to.
最后,我们对列进行重新排序.
Finally, we reorder the columns.
因此,这将遍历 starts
和 ends
,并将每次迭代传递到 start
和 end
中:
So, this will iterate over starts
and ends
and pass each iteration into start
and end
:
map2_df(starts, ends, function(start, end) {
# now, we extract just the current record into `record` by pulling
# out lines by the indexes.
record <- lines[start:end]
# we then use `read.dcf` to read in the date/subject/time values:
header <- as.data.frame(read.dcf(textConnection(record[1:3])))
# Since we do not have blank lines and you said the records were
# uniform we can use the fact that they'll be at known index
# positions in this `record`. So, we make a list of two vectors
# which are the indexes. Each becomes `i` (two total iterations)
# and we use the value in `i` to extract out the three lines from
# `record` and read those via `read.dcf`.
# But that reads things into a matrix and in an unhelpful order
# so we transpose it into shape and make it a data frame since
# we'll ultimately need that.
# We use `separate` to take the single character space-separated
# `V1` column and turn it into 4 columns. `read.dcf` gave us
# named rows for each time interval so we promote that to a
# full-on column and then add in date/time/subject, ensuring
# they are characters and not factors, then ensure that the
# values we split out from `V1` are numeric and not character or
# factor.
# `map_df` can add in the `A` and `B` from the named list we passed
# in for us and we have it call that column `measure`.
# finally, we put the columns in a better order.
map_df(list(A=5:7, B=9:11), function(i) {
read.dcf(textConnection(record[i])) %>%
t() %>% as_data_frame() %>%
separate(V1, sprintf("V%d", 1:4)) %>%
rownames_to_column("time_interval") %>%
mutate(date=as.character(header$Date),
time=as.character(header$Time),
subject=header$Subject) %>%
mutate_at(vars(starts_with("V")), as.numeric)
}, .id="measure")
}) %>%
select(date, time, subject, measure, time_interval, V1, V2, V3, V4)
这将产生以下输出:
## # A tibble: 18 x 9
## date time subject measure time_interval V1 V2 V3 V4
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01/01/2016 12:00:00 Subject1 A 1 1 2 4 1
## 2 01/01/2016 12:00:00 Subject1 A 2 2 1 2 3
## 3 01/01/2016 12:00:00 Subject1 A 3 1 0 2 7
## 4 01/01/2016 12:00:00 Subject1 B 1 2 3 0 1
## 5 01/01/2016 12:00:00 Subject1 B 2 4 1 1 2
## 6 01/01/2016 12:00:00 Subject1 B 3 3 5 2 8
## 7 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 8 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 9 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 10 01/01/2016 12:00:00 2 B 1 2 3 2 0
## 11 01/01/2016 12:00:00 2 B 2 6 7 1 2
## 12 01/01/2016 12:00:00 2 B 3 3 3 2 4
## 13 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 14 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 15 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 16 01/01/2016 12:00:00 2 B 1 2 3 2 0
## 17 01/01/2016 12:00:00 2 B 2 6 7 1 2
## 18 01/01/2016 12:00:00 2 B 3 3 3 2 4
如果您确实需要基本的R解决方案,那么:
If you really need a base R solution then:
do.call(rbind, mapply(function(start, end) {
record <- lines[start:end]
header <- as.data.frame(read.dcf(textConnection(record[1:3])))
do.call(rbind, lapply(list(A=5:7, B=9:11), function(i) {
mat <- as.data.frame(t(read.dcf(textConnection(record[i]))))
mat <- matrix(unlist(apply(mat, 1, strsplit, split=" "), use.names=FALSE), ncol=4, byrow=TRUE)
mat <- as.data.frame(mat)
mat$time_interval <- 1:3
mat$date <- as.character(header$Date)
mat$time <- as.character(header$Time)
mat$subject <- as.character(header$Subject)
mat
})) -> df
df$measure <- gsub("\\..*$", "", rownames(df))
rownames(df) <- NULL
df
}, starts, ends, SIMPLIFY=FALSE)) -> out_df
out_df[,c("date", "time", "subject", "measure", "time_interval", "V1", "V2", "V3", "V4")]
## date time subject measure time_interval V1 V2 V3 V4
## 1 01/01/2016 12:00:00 Subject1 A 1 1 2 4 1
## 2 01/01/2016 12:00:00 Subject1 A 2 2 1 2 3
## 3 01/01/2016 12:00:00 Subject1 A 3 1 0 2 7
## 4 01/01/2016 12:00:00 Subject1 B 1 1 2 4 1
## 5 01/01/2016 12:00:00 Subject1 B 2 2 1 2 3
## 6 01/01/2016 12:00:00 Subject1 B 3 1 0 2 7
## 7 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 8 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 9 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 10 01/01/2016 12:00:00 2 B 1 8 2 0 1
## 11 01/01/2016 12:00:00 2 B 2 9 1 2 7
## 12 01/01/2016 12:00:00 2 B 3 1 6 2 7
## 13 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 14 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 15 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 16 01/01/2016 12:00:00 2 B 1 8 2 0 1
## 17 01/01/2016 12:00:00 2 B 2 9 1 2 7
## 18 01/01/2016 12:00:00 2 B 3 1 6 2 7
这篇关于使用R导入格式异常的文本数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!