如何在R中设置for循环 [英] How to set a for -loop in R

查看:162
本文介绍了如何在R中设置for循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是生物学家,对编程的知识较少.我有一些文件(fasta格式的文件),需要为此应用R包.

每个文件的内容如下:

FILE_1.FASTA

>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT

FILE_2.FASTA

>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT

和程序包(R中的protr)的工作方式如下:

x = readFASTA(system.file(’protseq/P00750.fasta’, package = ’protr’))[[1]]

extractAAC(x)

是否有可能为上述行设置一个forloop来读取多个文件并将输出合并到一个文件中?

如有可能,请给我一些想法或任何示例,以帮助我在R中设置for循环.

解决方案

很有可能做到这一点.一个好的策略是编写一个函数,该函数封装您要对每个FASTA文件执行的操作:

# fasta is a string that represents the fasta file to be read.
read_and_extract <- function(fasta){
    seq <- readFASTA(fasta)[[1]]
    return(extractAAC(seq))
}

此包装器功能可让您一次读取FASTA文件并提取氨基酸成分.为了循环浏览这些文件,我们需要与您的FASTA文件位于同一目录中.

setwd("path/to/files")

使用dir命令,可以获得该目录中存在的文件的所有名称.

fasta_files <- dir(pattern = "[.]fasta$")

请注意,pattern参数告诉计算机仅读取以".fasta"结尾的文件

现在,我们使用vapply函数执行循环(有关详细信息,请参见下面的注释):

aa_comp <- vapply(fasta_files, read_and_extract, rep(pi, 20))

这将产生一个矩阵,其列为每个fasta文件,行为每个氨基酸.现在我们可以将其保存为简单的csv文件:

write.csv(aa_comp, file = "amino_acid_composition.csv")


vapply

的详细信息

vapply函数是在R中执行for循环的一种理想的方法(并且大多数时候是更快的).乍一看似乎有些令人困惑,但是如果您知道输出结果,它将非常有效.让我们看一下参数:

> vapply(Argument1, Argument2, Argument3)

  • 参数1:要遍历的向量(fasta_files)
  • Argument2:应用于向量(read_and_extract)的每个元素的函数
  • 参数3:预期输出(rep(pi, 20))

最后一个参数最初很难理解,但这是我们预期输出的代表向量.在这种情况下,extractAAC的文档说它返回长度为20的数字向量.命令rep(pi, 20)告诉R将数字pi复制20次,从而得到长度为20的数字向量. >

vapply的更通用的版本可以返回任何类型的输出.有关详细信息,请参见help("vapply").

I am a biologist and have less knowledge of programming. I have series of files(fasta format files) for which I need to apply an R package.

each file contents as follows:

FILE_1.FASTA

>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT

FILE_2.FASTA

>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT

and the package (protr in R) works like this:

x = readFASTA(system.file(’protseq/P00750.fasta’, package = ’protr’))[[1]]

extractAAC(x)

Is there any possibility to set a forloop for the above lines to read multiple files and give the output in one file??

If possible please give me some idea or any example which could help me set a for-loop in R.

解决方案

It is very possible to do this. A good strategy to use would be to write a function that encapsulates what you want to do with each FASTA file:

# fasta is a string that represents the fasta file to be read.
read_and_extract <- function(fasta){
    seq <- readFASTA(fasta)[[1]]
    return(extractAAC(seq))
}

This wrapper function will allow you to read the FASTA file and extract the amino acid composition all in one fell swoop. In order to loop over the files, we will need to be in the same directory as your FASTA files.

setwd("path/to/files")

Using the dir command, you can get all of the names of the files that exist in that directory.

fasta_files <- dir(pattern = "[.]fasta$")

Note that the pattern argument tells the computer to only read files that end with ".fasta"

Now we perform the loop using the vapply function (see note below for details):

aa_comp <- vapply(fasta_files, read_and_extract, rep(pi, 20))

This will produce a matrix with the columns being each fasta file and the rows being each amino acid. Now we can save this as a simple csv file:

write.csv(aa_comp, file = "amino_acid_composition.csv")


Details of vapply

The vapply function is a fancy (and most times faster) way to do for loops in R. It looks a bit confusing at first, but it works very well if you know what your output will be. Let's look at the arguments:

> vapply(Argument1, Argument2, Argument3)

  • Argument1: The vector to be looped over (fasta_files)
  • Argument2: The function to apply to each element of the vector (read_and_extract)
  • Argument3: The expected output (rep(pi, 20))

The last argument is the hardest to grasp initially, but it's a representative vector of our expected output. In this case, the documentation for extractAAC says that it returns a numeric vector of length 20. The command rep(pi, 20) is telling R to replicate the number pi 20 times, thus giving a numeric vector of length 20.

There are more generalized versions of vapply that can return output of any type. See help("vapply") for details on those.

这篇关于如何在R中设置for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆