重新排列许多txt文件的结构,然后将它们合并到一个数据框中 [英] Rearranging the structure of many txt files and then merging them in one data frame
问题描述
非常感谢您的帮助!
我有大约4.5k个txt文件,如下所示:
I have ~4.5k txt files which look like this:
Simple statistics using MSPA parameters: 8_3_1_1 on input file: 20130815 104359 875 000000 0528 0548_result.tif
MSPA-class [color]: Foreground/data pixels [%] Frequency
============================================================
CORE(s) [green]: -- 0
CORE(m) [green]: 48.43/13.45 1
CORE(l) [green]: -- 0
ISLET [brown]: 3.70/ 1.03 20
PERFORATION [blue]: 0.00/ 0.00 0
EDGE [black]: 30.93/ 8.59 11
LOOP [yellow]: 9.66/ 2.68 6
BRIDGE [red]: 0.00/ 0.00 0
BRANCH [orange]: 7.28/ 2.02 40
Background [grey]: --- /72.22 11
Missing [white]: 0.00 0
我想将目录中的所有txt文件读入R,然后在将它们合并在一起之前对它们执行重新排列任务.
I want to read all txt files from a directory into R and then perform a rearranging task on them before merging them together.
txt文件中的值可以更改,因此在现在有0.00的地方,某些文件中可能有相关的数字(因此我们需要这些).对于现在有的字段,如果脚本可以测试是否有-或数字,那将是很好的.如果有-,则应将它们转换为NA.另一方面,真正的0.00值很有价值,我需要它们.缺少的白色列(或此处的行)只有一个值,然后应将该值复制到前景%和数据像素%这两个列中.
The values in the txt files can change, so in places where there is a 0.00 now, could be a relevant number in some files (so we need those). For the fields where there are -- now, it would be good if the script could test if there are -- , or a number. If there are the --, then it should turn them into NAs. On the other hand, real 0.00 values are of value and I need them. There is only one value for the Missing white column (or row here), that value should then be copied into both columns, foreground% and data pixels%.
我需要的一般重新排列是使所有数据作为列可用,每个txt文件仅包含1行.对于此处txt文件中的每一行数据,输出文件中应有3列(每种颜色的前景百分比,数据像素百分比和频率).该行的名称应为文件开头提到的图像名称,此处为:20130815 104359 875 000000 0528 0548
The general rearranging which I need is to make all the data available as columns with only 1 row per txt file. For every row of data in the txt file here, there should be 3 columns in the output file (foreground%, data pixel% and frequency for every color). The name of the row should be the image name which is mentioned in the beginning of the file, here: 20130815 104359 875 000000 0528 0548
其余的可以省略.
输出应如下所示:
我正在同时进行这项工作,但不确定该朝哪个方向前进.因此,任何帮助都超过了欢迎!
I am working on this simultaneously but am not sure which direction to take. So any help is more than welcome!
最好, 莫里茨
推荐答案
我认为这是以您想要的格式显示的,但是示例与您的图像不匹配,所以我不确定:
This puts it in the format you want, I think, but the example doesn't match your image so I can't be sure:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"
lapply(lf, function(xx) {
rl <- readLines(con <- file(xx), warn = FALSE)
close(con)
## assuming the file name is after "file: " until the end of the string
## and ends in .tif
img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
## removes each string up to and including the ===== string
rl <- rl[-(1:grep('==', rl))]
## remove leading whitespace
rl <- gsub('^\\s+', '', rl)
## split the remaining lines by larger chunks of whitespace
mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
## more cleaning, setting attributes, etc
mat[mat == '--'] <- NA
mat <- cbind(image_name = img_name, `colnames<-`(t(mat[, 2]), mat[, 1]))
as.data.frame(mat)
})
我使用您的示例创建了三个文件,并使每个文件稍有不同,以显示在具有多个文件的目录中该文件如何工作:
I created three files using your example and made each one slightly different to show how this would work on a directory with several files:
# [[1]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20130815 104359 875 000000 0528 0548_result <NA> 48.43/13.45 <NA> 3.70/ 1.03 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00
#
# [[2]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20139341 104359 875 000000 0528 0548_result 23 48.43/13.45 23 <NA> 0.00/ 0.00 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 --- /72.22 0.00
#
# [[3]]
# image_name CORE(s) [green]: CORE(m) [green]: CORE(l) [green]: ISLET [brown]: PERFORATION [blue]: EDGE [black]: LOOP [yellow]: BRIDGE [red]: BRANCH [orange]: Background [grey]: Missing [white]:
# 1 20132343 104359 875 000000 0528 0548_result <NA> <NA> <NA> <NA> <NA> 30.93/ 8.59 9.66/ 2.68 0.00/ 0.00 7.28/ 2.02 <NA> 0.00
编辑
进行了一些更改以提取所有信息:
made a few changes to extract all the info:
(lf <- list.files('~/desktop', pattern = '^image\\d+.txt', full.names = TRUE))
# [1] "/Users/rawr/desktop/image001.txt" "/Users/rawr/desktop/image002.txt"
# [3] "/Users/rawr/desktop/image003.txt"
res <- lapply(lf, function(xx) {
rl <- readLines(con <- file(xx), warn = FALSE)
close(con)
img_name <- gsub('.*file:\\s+(.*).tif', '\\1', rl[1])
rl <- rl[-(1:grep('==', rl))]
rl <- gsub('^\\s+', '', rl)
mat <- do.call('rbind', strsplit(rl, '\\s{2, }'))
dat <- as.data.frame(mat, stringsAsFactors = FALSE)
tmp <- `colnames<-`(do.call('rbind', strsplit(dat$V2, '[-\\/\\s]+', perl = TRUE)),
c('Foreground','Data pixels'))
dat <- cbind(dat[, -2], tmp, image_name = img_name)
dat[] <- lapply(dat, as.character)
dat[dat == ''] <- NA
names(dat)[1:2] <- c('MSPA-class','Frequency')
zzz <- reshape(dat, direction = 'wide', idvar = 'image_name', timevar = 'MSPA-class')
names(zzz)[-1] <- gsub('(.*)\\.(.*) (?:.*)', '\\2_\\1', names(zzz)[-1], perl = TRUE)
zzz
})
这是结果(我只是将其转换成一个长矩阵,因此更易于阅读.实际结果在一个非常宽的数据帧中,每个文件一个):
here is the result (I just transformed into a long matrix so it would be easier to read. the real results are in a very wide data frame, one for each file):
`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]
# image_name "20130815 104359 875 000000 0528 0548_result"
# CORE(s)_Frequency "0"
# CORE(s)_Foreground "NA"
# CORE(s)_Data pixels "NA"
# CORE(m)_Frequency "1"
# CORE(m)_Foreground "48.43"
# CORE(m)_Data pixels "13.45"
# CORE(l)_Frequency "0"
# CORE(l)_Foreground "NA"
# CORE(l)_Data pixels "NA"
# ISLET_Frequency "20"
# ISLET_Foreground "3.70"
# ISLET_Data pixels "1.03"
# PERFORATION_Frequency "0"
# PERFORATION_Foreground "0.00"
# PERFORATION_Data pixels "0.00"
# EDGE_Frequency "11"
# EDGE_Foreground "30.93"
# EDGE_Data pixels "8.59"
# LOOP_Frequency "6"
# LOOP_Foreground "9.66"
# LOOP_Data pixels "2.68"
# BRIDGE_Frequency "0"
# BRIDGE_Foreground "0.00"
# BRIDGE_Data pixels "0.00"
# BRANCH_Frequency "40"
# BRANCH_Foreground "7.28"
# BRANCH_Data pixels "2.02"
# Background_Frequency "11"
# Background_Foreground "NA"
# Background_Data pixels "72.22"
# Missing_Frequency "0"
# Missing_Foreground "0.00"
# Missing_Data pixels "0.00"
带有您的示例数据:
lf <- list.files('~/desktop/data', pattern = '.txt', full.names = TRUE)
`rownames<-`(matrix(res[[1]]), names(res[[1]]))
# [,1]
# image_name "20130815 103704 780 000000 0372 0616"
# CORE(s)_Frequency "0"
# CORE(s)_Foreground "NA"
# CORE(s)_Data pixels "NA"
# CORE(m)_Frequency "1"
# CORE(m)_Foreground "54.18"
# CORE(m)_Data pixels "15.16"
# CORE(l)_Frequency "0"
# CORE(l)_Foreground "NA"
# CORE(l)_Data pixels "NA"
# ISLET_Frequency "11"
# ISLET_Foreground "3.14"
# ISLET_Data pixels "0.88"
# PERFORATION_Frequency "0"
# PERFORATION_Foreground "0.00"
# PERFORATION_Data pixels "0.00"
# EDGE_Frequency "1"
# EDGE_Foreground "34.82"
# EDGE_Data pixels "9.75"
# LOOP_Frequency "1"
# LOOP_Foreground "4.96"
# LOOP_Data pixels "1.39"
# BRIDGE_Frequency "0"
# BRIDGE_Foreground "0.00"
# BRIDGE_Data pixels "0.00"
# BRANCH_Frequency "20"
# BRANCH_Foreground "2.89"
# BRANCH_Data pixels "0.81"
# Background_Frequency "1"
# Background_Foreground "NA"
# Background_Data pixels "72.01"
# Missing_Frequency "0"
# Missing_Foreground "0.00"
# Missing_Data pixels "0.00"
这篇关于重新排列许多txt文件的结构,然后将它们合并到一个数据框中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!