使用列名向量将数据框的多个列(如对象)提取并粘贴在一起 [英] Extract and paste together multiple columns of a data frame like object using a vector of column names

查看:321
本文介绍了使用列名向量将数据框的多个列(如对象)提取并粘贴在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个对象(变量rld),它看起来有点像"data.frame"(有关详细信息,请参见后文),因为它具有可以使用$[[]]进行访问的列

I have an object (variable rld) which looks a bit like a "data.frame" (see further down the post for details) in that it has columns that can be accessed using $ or [[]].

我有一个向量groups,其中包含其某些列的名称(在下面的示例中为3).

I have a vector groups containing names of some of its columns (3 in example below).

我根据列中元素的组合生成字符串,如下所示:

I generate strings based on combinations of elements in the columns as follows:

paste(rld[[groups[1]]], rld[[groups[2]]], rld[[groups[3]]], sep="-")

我想对此进行概括,这样我就不必知道groups中有多少个元素.

I would like to generalize this so that I don't need to know how many elements are in groups.

以下尝试失败:

> paste(rld[[groups]], collapse="-")
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, error.if.nomatch = FALSE) : 
  attempt to extract more than one element

这是我要使用python字典以功能样式进行操作的方式:

Here is how I would do in functional-style with a python dictionary:

map("-".join, zip(*map(rld.get, groups)))

R中是否有类似的列获取器运算符?

如评论中所建议,这是dput(rld)的输出: http://paste.ubuntu .com/23528168/(由于它很大,我无法直接粘贴.)

As suggested in the comments, here is the output of dput(rld): http://paste.ubuntu.com/23528168/ (I could not paste it directly, since it is huge.)

这是使用DESeq2生物信息学软件包生成的,更准确地说,是执行与本文档第28页所述内容类似的操作:

This was generated using the DESeq2 bioinformatics package, and more precisely, doing something similar to what is described page 28 of this document: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf.

可以通过生物导体安装DESeq2,如下所示:

DESeq2 can be installed from bioconductor as follows:

source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

可复制的示例

其中一种解决方案在以交互方式运行时有效,但在将代码放入库函数中时失败,出现以下错误:

Reproducible example

One of the solutions worked when running in interactive mode, but failed when the code was put in a library function, with the following error:

Error in do.call(function(...) paste(..., sep = "-"), colData(rld)[groups]) : 
  second argument must be a list

经过一些测试,如果该函数位于主调用脚本中,则似乎不会发生此问题,如下所示:

After some tests, it appears that the problem doesn't occur if the function is in the main calling script, as follows:

library(DESeq2)
library(test.package)

lib_names <- c(
    "WT_1",
    "mut_1",
    "WT_2",
    "mut_2",
    "WT_3",
    "mut_3"
)
file_names <- paste(
    lib_names,
    "txt",
    sep="."
)

wt <- "WT"
mut <- "mut"
genotypes <- rep(c(wt, mut), times=3)
replicates <- c(rep("1", times=2), rep("2", times=2), rep("3", times=2))

sample_table = data.frame(
    lib = lib_names,
    file_name = file_names,
    genotype = genotypes,
    replicate = replicates
)

dds_raw <- DESeqDataSetFromHTSeqCount(
    sampleTable = sample_table,
    directory = ".",
    design = ~ genotype
    )

# Remove genes with too few read counts
dds <- dds_raw[ rowSums(counts(dds_raw)) > 1, ]
dds$group <- factor(dds$genotype)
design(dds) <- ~ replicate + group
dds <- DESeq(dds)

test_do_paste <- function(dds) {
    require(DESeq2)
    groups <- head(colnames(colData(dds)), -2)
    rld <- rlog(dds, blind=F)
    stopifnot(all(groups %in% names(colData(rld))))
    combined_names <- do.call(
        function (...) paste(..., sep = "-"),
        colData(rld)[groups]
    )
    print(combined_names)
}

test_do_paste(dds)
# This fails (with the same function put in a package)
#test.package::test_do_paste(dds)

示例中使用的数据:

WT_2.txt

WT_3.txt

mut_1.txt

mut_2.txt

mut_3.txt

我将此问题发布为一个单独的问题: do.call错误第二个参数必须是列表"代码在库中时使用S4Vectors

尽管我对最初的问题有一个答案,但我仍然对使用列名向量提取列"问题的替代解决方案感兴趣.

Although I have an answer to my initial question, I'm still interested in alternative solutions for the "column extraction using a vector of column names" issue.

推荐答案

我们可以使用以下任一方法:

We may use either of the following:

do.call(function (...) paste(..., sep = "-"), rld[groups])
do.call(paste, c(rld[groups], sep = "-"))

我们可以考虑一个可复制的小例子:

We can consider a small, reproducible example:

rld <- mtcars[1:5, ]
groups <- names(mtcars)[c(1,3,5,6,8)]
do.call(paste, c(rld[groups], sep = "-"))
#[1] "21-160-3.9-2.62-0"     "21-160-3.9-2.875-0"    "22.8-108-3.85-2.32-1" 
#[4] "21.4-258-3.08-3.215-1" "18.7-360-3.15-3.44-0"

请注意,您有责任确保all(groups %in% names(rld))TRUE,否则会出现下标超出范围"或未选择未定义的列"错误.

Note, it is your responsibility to ensure all(groups %in% names(rld)) is TRUE, otherwise you get "subscript out of bound" or "undefined column selected" error.

(我正在复制您的评论作为后续内容)

您提出的方法似乎不适用于我的对象.但是,我正在使用的软件包提供了colData函数,该函数与data.frame类似:

It seems the methods you propose don't work directly on my object. However, the package I'm using provides a colData function that makes something more similar to a data.frame:

> class(colData(rld))
[1] "DataFrame"
attr(,"package")
[1] "S4Vectors"

do.call(function (...) paste(..., sep = "-"), colData(rld)[groups])可以工作,但是do.call(paste, c(colData(rld)[groups], sep = "-"))失败,并显示一条我无法理解的错误消息(R ...经常出现):

do.call(function (...) paste(..., sep = "-"), colData(rld)[groups]) works, but do.call(paste, c(colData(rld)[groups], sep = "-")) fails with an error message I fail to understand (as too often with R...):

> do.call(paste, c(colData(rld)[groups], sep = "-"))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘mcols’ for signature ‘"character"’

这篇关于使用列名向量将数据框的多个列(如对象)提取并粘贴在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆