在R中提取一部分文件名 [英] extract part of a file name in R

查看:311
本文介绍了在R中提取一部分文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一些代码来打开文件夹中的所有数据文件,应用一个函数(或一组函数)来提取我感兴趣的数据.到目前为止,一切都很好.问题是我想使用文件名中的一个元素来重命名从每个文件中提取的一列,而且我很难弄清楚如何提取它.

I'm trying to write some code to open all the data files in a folder, apply a function (or set of functions) to extract my data of interest. So far, so good. The problem is that I'd like to re-name one of the columns I'm extracting from each file using one element of the file name, and I'm having a hard time figuring out how to extract it.

我有一堆名为"YYYY-MM-DDgeneName数据copy.txt"的文件,并想提取文件名的"geneName"部分. (例如,我有"2012-05-31 PMA1数据copy.txt".)

I have a bunch of files named "YYYY-MM-DD geneName data copy.txt" and would like to extract the "geneName" part of the file name. (For example, I have "2012-05-31 PMA1 data copy.txt".)

日期格式始终相同(YYYY-MM-DD),并且所有文件名都以"data copy.txt"结尾.

The date format is always the same (YYYY-MM-DD), and all the file names end in "data copy.txt".

此外,某些文件名在date和geneName之间的文件名中具有附加的实验注释("E(number)"或"Expt(number)")(例如,"2012-05-21" E7 PMA1数据copy.txt);其他的在geneName和"data copy.txt"之间有"SDM".

Additionally, some of the file names have an additional experiment annotation (either "E(number)" or "Expt(number)") in the file name between the date and geneName (for example, "2012-05-21 E7 PMA1 data copy.txt"); others have "SDM" between the geneName and "data copy.txt".

以下是一些文件名和所需输出的列表:

Here's a list of some file names and my desired output:

  • 2012-05-31 CTN1数据copy.txt(我想要"CTN1")
  • 2012-05-21 E7 PMA1数据copy.txt(想要"PMA1")
  • 2011-11-29 TDH3 SDM数据copy.txt(想要"TDH3")
  • 2012-01-04 POX1数据copy.txt(想要"POX1")

是否有任何想法,而不必手动从某些文件中删除实验编号或"SDM"?

Any thoughts about how I can do that without having to remove the experiment number or "SDM" from some of the files by hand?

谢谢!

推荐答案

此处的模式是日期,您不需要的可选E \ digit或Expt \ digit,您想要的单词以及可选的不需要的SDM后跟数据copy.txt" ...

The pattern here is a date, an optional E\digit or Expt\digit that you don't want, a word that you do want, then an optional SDM that you don't want followed by 'data copy.txt'...

这是我的测试数据:

> names
[1] "2012-05-31 CTN1 data copy.txt"          
[2] "2012-05-21 E7 PMA1 data copy.txt"       
[3] "2011-11-29 TDH3 SDM data copy.txt"      
[4] "2012-01-04 POX1 data copy.txt"          
[5] "2011-11-29 ECHO data copy.txt"          
[6] "2011-11-29 E8 ECHO data copy.txt"       
[7] "2011-11-29 ECHO SDM data copy.txt"      
[8] "2011-11-29 Expt2 ECHO SDM data copy.txt"

这是我的sub:

> sub(pattern="^....-..-.. (E\\d+ |Expt\\d+ )*(\\w+) (SDM )*data copy.txt","\\2",names)
[1] "CTN1" "PMA1" "TDH3" "POX1" "ECHO" "ECHO" "ECHO" "ECHO"

如果您的电子前缀超过一位,这也将起作用.我尝试从E开始向我的测试集中添加一些内容,以确保它们得到正确处理,以及电子前缀 SDM的情况.

If your E-prefixes have more than one digit this will also work. I've tried to add some things to my test set starting with E to make sure they get treated properly, as well as the case of an E-prefix and an SDM.

这篇关于在R中提取一部分文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆