如何在R中导入和排序糟糕堆积的CSV文件 [英] How to import and sort a poorly formed stacked CSV file in R

查看：160 发布时间：2018/2/3 18:14:13 r csv import format terminology

本文介绍了如何在R中导入和排序糟糕堆积的CSV文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何导入和排序这个数据（下面的代码段）以便于R操作？ > R的器官名称，剂量单位Gy，体积单位CC是否都考虑了
'因素'？数据集名称和数据变量的术语是什么？

这些直方图将一个数据集依次放在另一个数据集之后如下所示：

示例数据文件：

  Bladder ,, 
 GY，（CC），
 0.0910151,1.34265 
 0.203907,1.55719 
 [跳到本数据集结束处] 
 57.6659,0.705927 
 57.7787,0.196091 
 ,, 
 CTV-operator ,, 
 GY，（CC），
 39.2238,0.00230695 
 39.233,0 
 [对其余数据集重复;跳到文件结尾] 
 53.1489.0 
 53.2009,0.0161487 
 ,, 
 [空行] 
   
 我已经开始这个脚本，但我怀疑有一个更好的方法： / b> 
 
 
  [file = file.path（）] 
 DVH = read.csv（file，header = FALSE，sep = ，，fill = TRUE）
 
 DVH [3]<  -  NULL＃删除数据
的最后一列loop = 1; notover = TRUE 
因子（DVH [loop，1]）＃将第一个元素存储为一个因子
 while（notover）
 {loop = loop + 1＃移动到下一行
 DVH $ 1 < - 因子（DVH [loop，1]）＃我必须改变... 
 DVH $ 2 < - 因子（DVH [loop，2]）＃...这些行。 
 
存储第一个元素作为数据标签
＃存储第一个元素下一个元素作为数据标签
＃存储数据（观察给出）这个因子
＃如果行是空的，移动到下一行，将第一个元素存储为新的因子，并重复，直到文件结束
  
 Walter Roberson帮助我编写了这个代码来导入和解析MATLAB中的数据，到目前为止我已经或多或少地一直试图在R做同样的事情：
 
 $ p $  for fileloop = 1：length（indexnumber）
 num = 0 ; 
 fid = fopen（['filepath to folder'，num2str（indexnumber（fileloop）），'。csv']，'rt'）; 
，而
 H1 = fgetl（fid）; 
 if feof（fid）;打破;结束
 H2 = fgetl（fid）; 
 if feof（fid）;打破; end 
 datacell = textscan（fid，'％f％f'，'delimiter'，'，'，'collectoutput'，true）; 
如果isempty（datacell）||的isEmpty（DATACELL {1}）;打破;结束
，如果有的话（isnan（datacell {1}（end，:)））; datacell {1}（end，:) = [];结束
 num = num + 1; 
 headers（num，:) = {H1，H2}; 
 data（num）= datacell; 
 end 
 fclose（fid）; 
 clear datacell H1 H2

附加信息：

我是中级MATLAB经验的R新手。我正在从MATLAB转换到R，以便我的工作可能更容易被世界各地的其他人重现。（R是免费的; MATLAB不是。）

这些数据来自。

（

感谢您的时间。 解决方案

这是一个替代版本，比在for循环中逐行处理文件要快得多。这个版本首先读取整个数据文件到一个列数据框，然后清理数据，这应该比通过for循环处理要快得多。

 ＃加载所需的库
库（tidyr）
 
＃创建函数来处理文件
 process.file<  -  function（path）{
 $ b $＃将数据导入单列数据框
 df < -  as.data.frame（scan（path，character（），sep =\\\
，quiet = TRUE），stringsAsFactors = FALSE）
 
＃设置列名称
 colnames（df）<  - col1
 
＃将器官名称复制到新列
 df $函数（x）ifelse（regmatches（x，regexpr（。{2} $，x））==,,，gsub（'。{2} $ '，''，x），NA））
 
＃为所有行填充器官名称
 df < -  fill（df，organ，.direction =down）
 
＃删除包含管风琴的行
 df < -  df [regmatches（df [，1]，regexpr（。{2} $，df [，1]））！！= ,,，] 
 
＃复制（x，regexpr（。{1} $，x））==的单元转换为新的列
 df $ units < -  sapply（df [，1]，function（x）ifelse ），gsub（'。{1} $'，''，x），NA））
 
＃为所有行填充单位字段
 df < -  fill（df， .direction =down）
 
＃将单位分成dose.unit和vol.unit列
 df<  - 单独（df，units，c（dose.unit， ），
 
＃删除包含单元的行
 df < -  df [regmatches（df [，1]，regexpr（。{1将剩余的数据分成剂量和体积栏
 df<  - 分开（df，col1，df [，1]））！=，，] 
 
＃ c（剂量，体积），，）
 
＃设定剂量和体积的数据类型为数字
 df [，c（剂量，体积） ] < -  lapply（df [，c（dose，volume）]，as.numeric）
 
＃重新排序列
 df < -  df [，c（器官，剂量，dose.unit，音量，vol.unit）] 
 
＃返回数据帧
 return（df）
} 
 
＃设置到根文件夹目录的路径
 source.dir<  - ＃这里是根文件夹的路径
 
＃从文件夹
中检索所有文件注意：从文件夹中检索所有文件，所有的子文件夹，设置：recursive = TRUE 
＃注意：只包含名称中包含特定单词的文件，包括：pattern =your.pattern.here
 files<  -  list.files （source.dir，recursive = FALSE，full.names = TRUE）
 
＃处理每个文件并将数据框存储在列表中
 ldf < -  lapply（files，process.file）
 
＃将所有数据框合并到一个数据框中
 final.df<  -  do.call（rbind，ldf）

How can I import and sort this data (following code section) to be readily manipulated by R?

Are the organ names, dose unit 'Gy', volume unit 'CC' all three considered 'factors' by R? What is the terminology for the data set name and data variables?

These histograms place one data set sequentially after the other as follows:

Example Data File:
Bladder,, GY, (CC), 0.0910151,1.34265 0.203907,1.55719 [skipping to end of this data set] 57.6659,0.705927 57.7787,0.196091 ,, CTV-operator,, GY, (CC), 39.2238,0.00230695 39.233,0 [repeating for remainder of data sets; skipping to end of file] 53.1489,0 53.2009,0.0161487 ,, [blank line]
Data set labels (e.g. Bladder, CTV-operator, Rectum) are sometimes lowercase, and generally in a random order within the file. I have dozens of files categorized in two folders to import and analyze as one large patient sample.

I have started this script, but I suspect there is a better way:
[file = file.path()] DVH = read.csv(file, header = FALSE, sep = ",", fill = TRUE) DVH[3] <- NULL # delete last column from data loop = 1; notover = TRUE factor(DVH[loop,1]) # Store the first element as a factor while(notover) {loop = loop + 1 # move to next line DVH$1<-factor(DVH[loop,1]) # I must change ... DVH$2<-factor(DVH[loop,2]) # ... these lines. if([condition indicating end of file; code to be learned]) {notover = FALSE} } # store first element as data label # store next element as data label # store data for (observations given) this factor # if line is blank, move to next line, store first element as new factor, and repeat until end of file
Walter Roberson helped me prepare this code to import and parse the data in MATLAB, and so far I have more or less been trying to do the same thing in R:
for fileloop = 1:length(indexnumber) num = 0; fid = fopen(['filepath to folder',num2str(indexnumber(fileloop)),'.csv'],'rt'); while true H1 = fgetl(fid) ; if feof(fid); break; end H2 = fgetl(fid) ; if feof(fid); break; end datacell = textscan(fid, '%f%f', 'delimiter', ',', 'collectoutput', true) ; if isempty(datacell) || isempty(datacell{1}); break; end if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end num = num + 1; headers(num,:) = {H1, H2} ; data(num) = datacell; end fclose(fid); clear datacell H1 H2
Additional Info:

I am new to R with intermediate MATLAB experience. I am switching from MATLAB to R so that my work may be more readily reproducible by others worldwide. (R is free; MATLAB is not.)
This data is from exporting dose-volume histograms from radiation oncology software Velocity for cancer therapy research.

(I asked this question previously for Python but a computer scientist recommended I use R instead.)

Thank you for your time.
解决方案
Here is an alternate version which should work much quicker than processing the file line by line in a for loop. This version reads the entire data file first to a single column data frame and then cleans up the data, which should be much faster than processing via the for loop.
# Load required library library(tidyr) # Create function to process file process.file <- function(path){ # Import data into a single column dataframe df <- as.data.frame(scan(path, character(), sep = "\n", quiet = TRUE), stringsAsFactors = FALSE) # Set column name colnames(df) <- "col1" # Copy organ names to new column df$organ <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{2}$", x)) == ",,", gsub('.{2}$', '', x), NA)) # Fill organ name for all rows df <- fill(df, organ, .direction = "down") # Remove the rows that contained the organ df <- df[regmatches(df[,1], regexpr(".{2}$", df[,1])) != ",,", ] # Copy units into a new column df$units <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{1}$", x)) == ",", gsub('.{1}$', '', x), NA)) # Fill units field for all rows df <- fill(df, units, .direction = "down") # Separate units into dose.unit and vol.unit columns df <- separate(df, units, c("dose.unit","vol.unit"), ", ") # Remove the rows that contained the units df <- df[regmatches(df[,1], regexpr(".{1}$", df[,1])) != ",", ] # Separate the remaining data into dosage and volume columns df <- separate(df, col1, c("dosage","volume"), ",") # Set data type of dosage and volume to numeric df[,c("dosage","volume")] <- lapply(df[,c("dosage","volume")], as.numeric) # Reorder columns df <- df[, c("organ","dosage","dose.unit","volume","vol.unit")] # Return the dataframe return(df) } # Set path to root folder directory source.dir <- # Path to root folder here # Retrieve all files from folder # NOTE: To retrieve all files from the folder and all of it's subfolders, set: recursive = TRUE # NOTE: To only include files with certain words in the name, include: pattern = "your.pattern.here" files <- list.files(source.dir, recursive = FALSE, full.names = TRUE) # Process each file and store dataframes in list ldf <- lapply(files, process.file) # Combine all dataframes to a single dataframe final.df <- do.call(rbind, ldf)

这篇关于如何在R中导入和排序糟糕堆积的CSV文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在R中导入和排序糟糕堆积的CSV文件 [英] How to import and sort a poorly formed stacked CSV file in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在R中导入和排序糟糕堆积的CSV文件 [英] How to import and sort a poorly formed stacked CSV file in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭