R中行的成对比较 [英] Pairwise Comparison of Rows in R

查看:90
本文介绍了R中行的成对比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含许多样本中许多测试的结果。样本在数据集中复制。我想比较每组重复样本中重复样本之间的测试结果。我认为首先按SampleID拆分数据帧可能是最简单的方法,这样我可以获得一个数据帧列表,每个SampleID都有一个数据帧。一个样本可能有2、3、4甚至5个重复,因此每个样本组要比较的唯一行组合的数量是不同的。我具有下面所阐述的逻辑。我想在数据帧列表上运行一个函数并输出匹配结果。该函数将比较每组重复样本中2行的唯一集合,并返回 Match, Mismatch或NA的值(如果缺少一个或两个测试值)。它还将返回两个比较重复之间重叠的测试计数,匹配数和不匹配数。最后,它将在其中包含一列样本名称及其行号粘贴在一起的列,这样我就知道比较了两个样本(例如Sample1.1_Sample1.2)。有人可以指出我的正确方向吗?

I have a dataset that contains results for many tests across many samples. The samples are replicated within the dataset. I would like to compare the test results between replicates within each group of replicated samples. I thought it might be easiest to first split my data frame by the SampleID so that I have a list of data frames, one data frame for each SampleID. There could be 2, 3, 4, or even 5 replicates of a sample so the number of unique combinations of rows to compare for each sample group is not the same. I have the logic that I am thinking laid out below. I want to run a function on the list of data frames and output the match results. The function would compare unique sets of 2 rows within each group of replicated samples and return values of "Match", "Mismatch", or NA (if one or both values for a test is missing). It would also return the count of tests that overlapped between the 2 compared replicates, the number of matches, and the number of mismatches. Lastly, it would include a column where the sample names are pasted together with their row numbers so I know which two samples were compared (ex. Sample1.1_Sample1.2). Could anyone point me in the right direction?

    #Input data structure
    data = as.data.frame(cbind(rbind("Sample1","Sample1","Sample2","Sample2","Sample2"),rbind("A","A","C","C","C"), rbind("A","T","C","C","C"), 
                 rbind("A",NA,"C","C","C"), rbind("A","A","C","C","C"), rbind("A","T","C","C",NA), rbind("A","A","C","C","C"),
                 rbind("A","A","C","C","C"), rbind("A",NA,"C","T","T"), rbind("A","A","C","C","C"), rbind("A","A","C","C","C")))

    colnames(data) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10")
    data 

    data.split = split(data, data$SampleID)


    ##Row comparison function
    #Input is a list of data frames. Each data frame contains results for replicates of the same sample.
    RowCompare = function(x){
      rowcount = nrow(x)
      ##ifelse(rowcount==2,
        ##compare row 1 to row 2
          ##paste sample names being compared together
          ##how many non-NA values overlap, keep value
          ##of those that overlap, how many match, keep value
          ##of those that overlap, how many do not match, keep value
      #ifelse(rowcount==3,
          ##compare row 1 to row 2
            ##paste sample names being compared together
            ##how many non-NA values overlap, keep value
            ##of those that overlap, how many match, keep value
            ##of those that overlap, how many do not match, keep value
          ##compare row 1 to row 3
            ##paste sample names being compared together
            ##how many non-NA values overlap, keep value
            ##of those that overlap, how many match, keep value
            ##of those that overlap, how many do not match, keep value
          ##compare row 2 to row 3
            ##paste sample names being compared together
            ##how many non-NA values overlap, keep value
            ##of those that overlap, how many match, keep value
            ##of those that overlap, how many do not match, keep value
      return(results)
    }

    #Output is a list of data frames - one for sample name
    out = lapply(names(data.split), function(x) RowCompare(data.split[[x]])) 

    #Row bind the list of data frames back together to one large data frame
    out.merge = do.call(rbind.data.frame, out) 
    head(out.merge)

    #Desired output
    out.merge = as.data.frame(cbind(rbind("Sample1.1_Sample1.2","Sample2.1_Sample2.2","Sample2.1_Sample2.3","Sample2.2_Sample2.3"),rbind("Match","Match","Match","Match"), 
                      rbind("Mismatch","Match","Match","Match"), rbind(NA,"Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind("Mismatch","Match",NA,NA), 
                      rbind("Match","Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind(NA,"Mismatch","Mismatch","Match"), rbind("Match","Match","Match","Match"), 
                      rbind("Match","Match","Match","Match"), rbind(8,10,9,9), rbind(6,9,8,8), rbind(2,1,1,1)))

    colnames(out.merge) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10", "Num_Overlap", "Num_Match","Num_Mismatch")
    out.merge

我确实在另一篇文章中看到的一件事人们认为可能有用的是下面的行,该行将创建唯一行组合的数据框,然后可以使用该行定义每个复制样本组中要比较的行。

One thing I did see on another post that I thought might be useful is the line below which would create a data frame of unique row combinations that could then be used to define which rows to compare in each group of replicated samples. Not sure how to implement it though.

    t(combn(nrow(data),2))

谢谢。

推荐答案

您在 t(combn(nrow(data),2))的正确轨道上。

testCols <- which(grepl("^Test\\d+",colnames(data)))

TestsCompare=function(x,y){
  ##how many non-NA values overlap
  overlaps <- sum(!is.na(x) & !is.na(y))
  ##of those that overlap, how many match
  matches <- sum(x==y, na.rm=T)
  ##of those that overlap, how many do not match
  non_matches <- overlaps - matches # complement of matches
  c(overlaps,matches,non_matches)
}

RowCompare= function(x){
  comp <- NULL
  pairs <- t(combn(nrow(x),2))
  for(i in 1:nrow(pairs)){
    row_a <- pairs[i,1]
    row_b <- pairs[i,2]
    a_tests <- x[row_a,testCols]
    b_tests <- x[row_b,testCols]
    comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
  }
  colnames(comp) <- c("row_a","row_b","overlaps","matches","non_matches")
  return(comp)
}

out = lapply(data.split, RowCompare)

产生:

> out
$Sample1
     row_a row_b overlaps matches non_matches
[1,]     1     2        8       6           2

$Sample2
     row_a row_b overlaps matches non_matches
[1,]     1     2       10       9           1
[2,]     1     3        9       8           1
[3,]     2     3        9       9           0

这篇关于R中行的成对比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆