使用R导入格式异常的文本数据 [英] Import unusually formatted text data using R

查看:54
本文介绍了使用R导入格式异常的文本数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从一台设备上输出了数据.不幸的是,输出数据的组织不是很好,我一直在用R编写代码来分解它.本质上,数据是粘贴到一个长文档中的每个主题的单独信息列表(基本描述信息,以及每个时间间隔的两个不同测量A和B的原始数据).例如:

I have output data from a piece of equipment. Unfortunately the output data is not organized very well, and I have been writing a code in R to break it down. Essentially the data is a separate list of information (basic descriptive information, and raw data for two different measurements A and B for each time interval) for each subject pasted into one long document. For example:

Date: 01/01/2016
Time: 12:00:00 
Subject: Subject1
A: 
1: 1  2  4  1 
2: 2  1  2  3
3: 1  0  2  7
B:
1: 2  3  0  1
2: 4  1  1  2 
3: 3  5  2  8

Date: 01/01/2016
Time: 12:00:00 
Subject: Subject2   
A: 
1: 8  2  0  1 
2: 9  1  2  7
3: 1  6  2  7
B:
1: 2  3  2  0
2: 6  7  1  2
3: 3  3  2  4

我已经用R编写了一个有效的代码,但是使用split(seq_along),for循环和do.call(不是很优雅)(主要基于

I have written a code in R that works, but is not very elegant, using split(seq_along), for-loops, and do.call (based primarily on this stack overflow question and this blog post).

# First read text file in as a character vector called ‘example’

    scan("example_file.txt", what="character", strip.white=T, sep="\n") -> example

# Separate the header text (before the colon) from the proceeding data
# and make that text name the components of the vector

    regmatches(example, regexpr(example, pattern="[[:alnum:]]+:", useBytes = F)) -> names(example)
    gsub(example, pattern="[[:print:]]+: ", replacement="", useBytes = F)-> example.2

# Then, split character vector into a list based on how many lines are
# dedicated to each subject (in this example, 11 lines); based on SE
# answer cited above

    strsplit(example.2, "([A-Z]:)") -> example.3
    split(as.list(example.3), ceiling(seq_along(example.2)/11)) -> example.4

# Use a for-loop to systematically add the data together for subjects 1
# and 2 for time interval 1, using the method detailed from a blog post
# (cited above)

    my.list <- list()

    for(i in 1:2){
            strsplit(as.character(example.4[[i]][5]), split="[[:blank:]]+") -> A
            strsplit(as.character(example.4[[i]][9]), split="[[:blank:]]+")-> B

            as.vector(c(as.character(example.4[[i]][3]), "A", unlist(A))) -> A_char
            as.vector(c(as.character(example.4[[i]][3]), "B", unlist(B))) -> B_char

            paste(as.character(example.4[[i]][3]), "Measure_A") -> a_name
            paste(as.character(example.4[[i]][3]), "Measure_B") -> b_name

            my.list[[a_name]] <- A_char
            my.list[[b_name]] <- B_char
    }

    final.data <- do.call(rbind, my.list)
    as.data.frame(final.data) -> final.data

    names(final.data) <- c("Subject", "Measure", "V1", "V2", "V3", "V4")

我可以使用我的代码(例如,上面的行"1:1 2 4 1"和"1:2 3 0 1")提取所有对象在A和B的单个时间间隔内的数据,并放入将所有信息放到一个数据框中.当我想在所有时间间隔( all )而不是一个时间间隔内执行此操作时,会变得混乱.我不知道如何在没有为每个时间间隔运行单独的for循环的情况下执行此操作.我尝试在for循环内执行for循环,但这没有用.我也不清楚如何使用apply()类型的函数.

I can extract the data for a single time interval for A and B across all subjects using my code (for example, the lines "1: 1 2 4 1" and "1: 2 3 0 1" above) and put put all the information together in a data frame. Where is gets messy is when I want to do this for all of the time intervals, not just one time interval. I can't figure out how to do this without running separate for-loops for each time interval. I tried doing a for-loop within a for-loop, but that didn't work. I also couldn’t figure out how to do this with the apply()-type functions.

按照本例,如果我只有3个时间间隔,这个问题就不会那么严重,但是我的实际数据要长得多.任何更优雅和简洁的方法的建议,将不胜感激!

If I only had 3 time intervals, as per this example, this issue wouldn’t be so bad, but my actual data is a lot longer. Any suggestions for a more elegant and concise approach would be appreciated!

P.S.我知道上述代码给出的最终数据帧具有多余的行名.但是,这是确保最终数据框的主题和度量信息与我应用于早期R对象的标签对齐的一种有用方法.

P.S. I am aware that the final data frame that the above code gives has redundant row names. However, this is a helpful way of making sure that the final data frame’s subject and measure information lines up with the labels I had applied to earlier R objects.

推荐答案

除行名外,此操作不执行任何操作:

This does everything but the rownames:

lines <- readLines(textConnection("Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2

3: 3 3 2 4
3: 3 3 2 4"))

非基本R解决方案需要一些库:

Some libraries we'll need for the non-base R solution:

library(purrr)
library(tibble)
library(tidyr)
library(dplyr)

修剪空白并过滤出空白行:

Trim whitespace and filter out blank lines:

trimws(lines) %>% discard(`==`, "") -> lines

这是记录开始的中索引的向量(通过在行首找到 Date:来指定):

This makes a vector of the indexes in lines where the records start (which is designated by finding Date: at the beginning of a line):

starts <- which(grepl("^Date:", lines))

现在,我们从头开始,寻找 Date: next 出现(即下一条记录).它将找到所有这些,因此我们只关心第一个.为了计算该索引,我们添加起始索引并减去 1 .从理论上讲,只有一个 NA (即最后一个记录),但是我们懒惰地使用 ifelse 而不是只更改最后一个.

Now, we take those starts and look for the next occurrence of Date: (i.e. the next record). It's going to find them all, so we only care about the first one. To calculate that index, we add the start index and subtract 1. In theory there will only be one NA (i.e. the last record) but we lazily use ifelse vs just change he last one.

ends <- map_dbl(starts, function(i) {
  which(grepl("^Date:", lines[(i+1):length(lines)]))[1]+i-1
})
ends <- ifelse(is.na(ends), length(lines), ends)

因此,现在 starts 包含每个记录的开始索引,而 ends 包含每个记录的结束索引.

So, now starts contains the indexes of the start of each record and ends contains the indexes of the ends of each record.

map2_df() mapply()&的超级方便的伪包装器. do.call(rbind,...).我们使用这些格式为DCF格式(键:值)并使用 read.dcf()的事实.这样就构成了一个矩阵,然后我们对其进行重新定向并将其转换为data.frame.

The map2_df() is super handy pseudo-wrapper for mapply() & do.call(rbind,…). We use the fact that these are in DCF format (key: value) and use read.dcf(). That makes a matrix and we then re-orient it and turn it into a data.frame.

然后我们将值分开,添加行名以创建 time_interval 列,添加日期,时间和主题,并确保这些列是正确的类型.

We then separate the values, add the row names to make a time_interval column, add in the date, time and subject and make sure the columns are the right type.

如果我们告诉我们, map2_df()将使用命名列表键"作为列的事实.

We also use the fact that map2_df() will use the named list "keys" as a column if we tell it to.

最后,我们对列进行重新排序.

Finally, we reorder the columns.

因此,这将遍历 starts ends ,并将每次迭代传递到 start end 中:

So, this will iterate over starts and ends and pass each iteration into start and end:

map2_df(starts, ends, function(start, end) {

  # now, we extract just the current record into `record` by pulling
  # out lines by the indexes.

  record <- lines[start:end]

  # we then use `read.dcf` to read in the date/subject/time values:

  header <- as.data.frame(read.dcf(textConnection(record[1:3])))

  # Since we do not have blank lines and you said the records were
  # uniform we can use the fact that they'll be at known index
  # positions in this `record`. So, we make a list of two vectors
  # which are the indexes. Each becomes `i` (two total iterations)
  # and we use the value in `i` to extract out the three lines from
  # `record` and read those via `read.dcf`.

  # But that reads things into a matrix and in an unhelpful order
  # so we transpose it into shape and make it a data frame since
  # we'll ultimately need that.

  # We use `separate` to take the single character space-separated
  # `V1` column and turn it into 4 columns. `read.dcf` gave us
  # named rows for each time interval so we promote that to a 
  # full-on column and then add in date/time/subject, ensuring
  # they are characters and not factors, then ensure that the 
  # values we split out from `V1` are numeric and not character or
  # factor.

  # `map_df` can add in the `A` and `B` from the named list we passed
  # in for us and we have it call that column `measure`.

  # finally, we put the columns in a better order.

  map_df(list(A=5:7, B=9:11), function(i) {
    read.dcf(textConnection(record[i])) %>%  
      t() %>% as_data_frame() %>%
      separate(V1, sprintf("V%d", 1:4)) %>%
      rownames_to_column("time_interval") %>%
      mutate(date=as.character(header$Date),
             time=as.character(header$Time),
             subject=header$Subject) %>%
      mutate_at(vars(starts_with("V")), as.numeric)

  }, .id="measure")

}) %>% 
  select(date, time, subject, measure, time_interval, V1, V2, V3, V4)

这将产生以下输出:

## # A tibble: 18 x 9
##          date     time  subject measure time_interval    V1    V2    V3    V4
##         <chr>    <chr>    <chr>   <chr>         <chr> <dbl> <dbl> <dbl> <dbl>
## 1  01/01/2016 12:00:00 Subject1       A             1     1     2     4     1
## 2  01/01/2016 12:00:00 Subject1       A             2     2     1     2     3
## 3  01/01/2016 12:00:00 Subject1       A             3     1     0     2     7
## 4  01/01/2016 12:00:00 Subject1       B             1     2     3     0     1
## 5  01/01/2016 12:00:00 Subject1       B             2     4     1     1     2
## 6  01/01/2016 12:00:00 Subject1       B             3     3     5     2     8
## 7  01/01/2016 12:00:00        2       A             1     8     2     0     1
## 8  01/01/2016 12:00:00        2       A             2     9     1     2     7
## 9  01/01/2016 12:00:00        2       A             3     1     6     2     7
## 10 01/01/2016 12:00:00        2       B             1     2     3     2     0
## 11 01/01/2016 12:00:00        2       B             2     6     7     1     2
## 12 01/01/2016 12:00:00        2       B             3     3     3     2     4
## 13 01/01/2016 12:00:00        2       A             1     8     2     0     1
## 14 01/01/2016 12:00:00        2       A             2     9     1     2     7
## 15 01/01/2016 12:00:00        2       A             3     1     6     2     7
## 16 01/01/2016 12:00:00        2       B             1     2     3     2     0
## 17 01/01/2016 12:00:00        2       B             2     6     7     1     2
## 18 01/01/2016 12:00:00        2       B             3     3     3     2     4

如果您确实需要基本的R解决方案,那么:

If you really need a base R solution then:

do.call(rbind, mapply(function(start, end) {

  record <- lines[start:end]
  header <- as.data.frame(read.dcf(textConnection(record[1:3])))

  do.call(rbind, lapply(list(A=5:7, B=9:11), function(i) {
    mat <- as.data.frame(t(read.dcf(textConnection(record[i]))))
    mat <- matrix(unlist(apply(mat, 1, strsplit, split=" "), use.names=FALSE), ncol=4, byrow=TRUE)
    mat <- as.data.frame(mat)
    mat$time_interval <- 1:3
    mat$date <- as.character(header$Date)
    mat$time <- as.character(header$Time)
    mat$subject <- as.character(header$Subject)
    mat
  })) -> df

  df$measure <- gsub("\\..*$", "", rownames(df))
  rownames(df) <- NULL
  df

}, starts, ends, SIMPLIFY=FALSE)) -> out_df
out_df[,c("date", "time", "subject", "measure", "time_interval", "V1", "V2", "V3", "V4")]

##          date     time  subject measure time_interval V1 V2 V3 V4
## 1  01/01/2016 12:00:00 Subject1       A             1  1  2  4  1
## 2  01/01/2016 12:00:00 Subject1       A             2  2  1  2  3
## 3  01/01/2016 12:00:00 Subject1       A             3  1  0  2  7
## 4  01/01/2016 12:00:00 Subject1       B             1  1  2  4  1
## 5  01/01/2016 12:00:00 Subject1       B             2  2  1  2  3
## 6  01/01/2016 12:00:00 Subject1       B             3  1  0  2  7
## 7  01/01/2016 12:00:00        2       A             1  8  2  0  1
## 8  01/01/2016 12:00:00        2       A             2  9  1  2  7
## 9  01/01/2016 12:00:00        2       A             3  1  6  2  7
## 10 01/01/2016 12:00:00        2       B             1  8  2  0  1
## 11 01/01/2016 12:00:00        2       B             2  9  1  2  7
## 12 01/01/2016 12:00:00        2       B             3  1  6  2  7
## 13 01/01/2016 12:00:00        2       A             1  8  2  0  1
## 14 01/01/2016 12:00:00        2       A             2  9  1  2  7
## 15 01/01/2016 12:00:00        2       A             3  1  6  2  7
## 16 01/01/2016 12:00:00        2       B             1  8  2  0  1
## 17 01/01/2016 12:00:00        2       B             2  9  1  2  7
## 18 01/01/2016 12:00:00        2       B             3  1  6  2  7

这篇关于使用R导入格式异常的文本数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆