循环浏览R中的.dat文件并仅提取特定数据作为列 [英] Looping through .dat files in R and extracting only specific data as columns

查看:97
本文介绍了循环浏览R中的.dat文件并仅提取特定数据作为列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的本​​地驱动器中有900多个文件夹,每个文件夹都有一个.dat扩展名文件.我想遍历每个文件夹以访问其中的文件,以仅获取特定数据并将该数据写入新文件中.每个.dat文件看起来都像这样-

I have 900+ folders in my local drive and each folder has a single .dat extension file. I want to loop through each folder to access the file in it to fetch only specific data and write that data in a new file. Each .dat file looks something like this -

Authors:
#    Pallavi Subhraveti
#    Quang Ong
#    Tim Holland
#    Anamika Kothari
#    Ingrid Keseler 
#    Ron Caspi
#    Peter D Karp

# Please see the license agreement regarding the use of and distribution of 
this file.
# The format of this file is defined at http://bioinformatics.ai.sri.com
# Version: 21.5
# File Name: compounds.dat
# Date and time generated: October 24, 2017, 14:52:45

# Attributes:
#    UNIQUE-ID
#    TYPES
#    COMMON-NAME
#    ABBREV-NAME
#    ACCESSION-1
#    ANTICODON
#    ATOM-CHARGES
#    ATOM-ISOTOPES
#    CATALYZES
#    CFG-ICON-COLOR
#    CHEMICAL-FORMULA
#    CITATIONS
#    CODONS
#    COFACTORS-OF
#    MOLECULAR-WEIGHT
#    MONOISOTOPIC-MW

[Data Chunk 1]
UNIQUE-ID - CPD0-1108
TYPES - D-Ribofuranose
COMMON-NAME - β-D-ribofuranose
ATOM-CHARGES - (9 -1)
ATOM-CHARGES - (6 1)
CHEMICAL-FORMULA - (C 5)
CHEMICAL-FORMULA - (H 14)
CHEMICAL-FORMULA - (N 1)
CHEMICAL-FORMULA - (O 6)
CHEMICAL-FORMULA - (P 1)
CREDITS - SRI
CREDITS - kaipa
DBLINKS - (CHEBI "10647" NIL |kothari| 3594051403 NIL NIL)
DBLINKS - (BIGG "37147" NIL |kothari| 3584718837 NIL NIL)
DBLINKS - (PUBCHEM "25200464" NIL |taltman| 3466375284 NIL NIL)
DBLINKS - (LIGAND-CPD "C01233" NIL |keseler| 3342798255 NIL NIL)
INCHI - InChI=1S/C5H14NO6P/c6-1-2-11-13(9,10)12-4-5(8)3-7/h5,7-8H,1-4,6H2,(H,9,10)
MOLECULAR-WEIGHT - 215.142    
MONOISOTOPIC-MW - 216.0636987293    
NON-STANDARD-INCHI - InChI=1S/C5H14NO6P/c6-1-2-11-13(9,10)12-4-5(8)3-7/h5,7-8H,1-4,6H2,(H,9,10)
SMILES - C(OP([O-])(OCC(CO)O)=O)C[N+]
SYNONYMS - sn-Glycero-3-phosphoethanolamine
SYNONYMS - 1-glycerophosphorylethanolamine\
[Data Chunk 2]
//
UNIQUE-ID - URIDINE
TYPES - Pyrimidine
....
....

每个文件中都有大约18000行(查看Notepad ++中的数据).现在,我想创建一个新文件,并仅复制数据中的特定列.我只希望将这些列复制到新创建的文件中,并且该文件应如下所示-

Each file has approximately 18000 lines in it (looking at the data in Notepad++). Now I want to create a new file and copy only specific columns from the data. I want only these columns to be copied in my newly created file and the file should look like this -

UNIQUE-ID       TYPES        COMMON-NAME           CHEMICAL-FORMULA  BIGG ID    CHEMSPIDER ID    CAS ID    CHEBI ID    PUBCHEM ID    MOLECULAR-WEIGHT MONOISOTOPIC-MW

CPD0-1108    D-Ribofuranose  β-D-ribofuranose   C5H14N1O6P1      37147       NA                NA      10647       25200464        215.142       216.0636987293

URIDINE      Pyrimidine       ...

每个文件中的每个数据块不一定都具有我需要的所有列的信息,这就是为什么我在我想要的输出表中为那些列提到了NA的原因.尽管在这些列中获取空白值是完全可以的,因为以后我可以分别处理这些空白.

Every chunk of data in each file doesn't necessarily have information for all the columns I need which is why I have mentioned NA for those columns in the output table I want. Although it's completely fine if I get blank values in those columns as I can deal with those blanks later on separately.

这是具有数据的目录-

File 1] -> C:\Users\robbie\Desktop\Organism_Data\aact1035194-hmpcyc\compounds.dat
File 2] -> C:\Users\robbie\Desktop\Organism_Data\aaph679198-hmpcyc\compounds.dat
File 3] -> C:\Users\robbie\Desktop\Organism_Data\yreg1002368-hmpcyc\compounds.dat
File 4] -> C:\Users\robbie\Desktop\Organism_Data\tden699187-hmpcyc\compounds.dat
...
...

我真的倾向于在R中使用dir函数,引用这篇帖子,但是由于有机体名称(文件夹名称)很奇怪并且不一致,我在编写代码时却对函数的模式参数中的内容感到困惑.

I was really inclined towards using the dir function in R referring this post but I got confused what to put in the pattern parameter of the function while writing the code as the organism names(folder names) are pretty weird and not consistent.

对于获得所需输出的任何帮助,我们将不胜感激.我正在考虑在R中执行此操作的方法,但是如果我有很好的建议以及在python中处理此问题的方法,我也愿意在python中尝试.在此先感谢您的帮助!

Any help for getting the required output is greatly appreciated. I was thinking of ways to do this in R but I am also open to try this in python as well if I get good suggestions and ways to deal with this in python. Thanks much in advance for any help!

链接到数据-数据

推荐答案

另一种方法,在这种情况下,它仅读取您提供的文件,但可以读取多个文件.

another approach, i this case it's only reading the file you provided but it can read multiple files.

我添加一些中间结果以显示代码的实际作用...

I add some intermediate results to show what the code is actually doing...

library(tidyverse)
library(data.table)
library(zoo)

# create a data.frame with the desired files
filenames <- list.files( path = getwd(), pattern = "*.dat$", recursive = TRUE, full.names = TRUE ) 

# > filenames
#[1] "C:/Users/********/Documents/Git/udls2/test.dat"

#read in the files, using data.table's fread.. here I grep lines starting with UNIQUE-ID or TYPES. create your desired regex-pattern
pattern <- "^UNIQUE-ID|^TYPES"
content.list <- lapply( filenames, function(x) fread( x, sep = "\n", header = FALSE )[grepl( pattern, V1 )] )

# > content.list
# [[1]]
#                        V1
# 1:  UNIQUE-ID - CPD0-1108
# 2: TYPES - D-Ribofuranose
# 3:    UNIQUE-ID - URIDINE
# 4:     TYPES - Pyrimidine

#add all content to a single data.table
dt <- rbindlist( content.list )

# > dt
#                        V1
# 1:  UNIQUE-ID - CPD0-1108
# 2: TYPES - D-Ribofuranose
# 3:    UNIQUE-ID - URIDINE
# 4:     TYPES - Pyrimidine

#split the text in a variable-name and it's content
dt <- dt %>% separate( V1, into = c("var", "content"), sep = " - ")

# > dt
#          var        content
# 1: UNIQUE-ID      CPD0-1108
# 2:     TYPES D-Ribofuranose
# 3: UNIQUE-ID        URIDINE
# 4:     TYPES     Pyrimidine

#add an increasing id for every UNIQUE-ID
dt[var == "UNIQUE-ID", id := seq.int( 1: nrow( dt[var=="UNIQUE-ID", ]))]

# > dt
#          var        content id
# 1: UNIQUE-ID      CPD0-1108  1
# 2:     TYPES D-Ribofuranose NA
# 3: UNIQUE-ID        URIDINE  2
# 4:     TYPES     Pyrimidine NA

#fill down id vor all variables found
dt[, id := na.locf( dt$id )]

# > dt
#          var        content id
# 1: UNIQUE-ID      CPD0-1108  1
# 2:     TYPES D-Ribofuranose  1
# 3: UNIQUE-ID        URIDINE  2
# 4:     TYPES     Pyrimidine  2

#cast
dcast(dt, id ~ var, value.var = "content")

#    id          TYPES UNIQUE-ID
# 1:  1 D-Ribofuranose CPD0-1108
# 2:  2     Pyrimidine   URIDINE

这篇关于循环浏览R中的.dat文件并仅提取特定数据作为列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆