如何比较两个数据帧/表和提取数据在R? [英] How to compare two data frames/tables and extract data in R?

查看:110
本文介绍了如何比较两个数据帧/表和提取数据在R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试提取下面两个数据帧之间的不匹配我已经设法创建一个新的数据帧,其中不匹配被替换。

现在我需要的是一个不匹配列表:

  dfA < -  structure(list(animal1 = c(AA,TT,AG,CA) ,动物2 = c(AA,TB,AG,CA),animal3 = c(AA,TT,AG,CA)).Names = c动物2,动物3),row.names = c(snp1,snp2,snp3,snp4),class =data.frame)
# dfA
#animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TT
#snp3 AG AG AG
#snp4 CA CA CA
dfB < - structure(list(animal1 = c(AA,TT,AG,CA),animal2 = c(AA,TB,AG,DF), = c(AA,TB,AG,DF)).Names = c(animal1,animal2,animal3),row.names = c(snp1 snp2,snp3,snp4),class =data.frame)
# dfB
#animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TB
#snp3 AG AG AG
#snp4 CA DF DF



为了澄清不匹配,这里将它们标记为00:

 #animal1 animal2 animal3 
#snp1 AA AA AA
#snp2 TT TB 00
#snp3 AG AG AG
#snp4 CA 00 00

我需要以下输出:

  structure(list(snpname = structure(c(1L,2L,2L),.Label = c(snp2,snp4),class =factor),animalname = structure (c(2L,1L,2L),.Label = c(animal2,animal3),class =factor),alleledfA = CA,TT),类别=因子),等位基因fB =结构(c(2L,1L,1L),.Label = c(DF,TB .names = c(snpname,animalname,alleledfA,alleledfB),class =data.frame,row.names = c(NA,-3L))
#snpname animalname alleledfA alleledfB
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF

到目前为止,我一直试图从我的 lapply 函数中提取额外的数据,我用它来替换不匹配的零,而不是成功。我也试图写一个ifelse函数没有成功。希望你们能在这里帮助我!



最终,这将针对尺寸为100K乘以1000的数据集运行,因此效率是专业版

解决方案

这个问题有 data.table 标签,所以这里是我使用这个包的尝试。第一步是将行名转换为列 data.table 不喜欢那些,然后转换为长格式 rbind

  library(data.table)
setDT(dfA,keep.rownames = TRUE)
setDT(dfB,keep.rownames = TRUE)

dcast ,
dfB,
idcol = TRUE),
id = 1:2
)[,
if(uniqueN(value)> 1L).SD,
by =。(rn,variable)],
rn + variable〜.id)

#rn变量1 2
#1:snp2 animal3 TT TB
#2:snp4 animal2 CA DF
#3:snp4 animal3 CA DF


In attempt to extract mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced.
What I need now is a list of mismatches:

dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      TT
# snp3      AG      AG      AG
# snp4      CA      CA      CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
#     animal1 animal2 animal3
#snp1      AA      AA      AA
#snp2      TT      TB      TB
#snp3      AG      AG      AG
#snp4      CA      DF      DF

To clarify the mismatches, here they are marked as 00's:

#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      00
# snp3      AG      AG      AG
# snp4      CA      00      00

I need the following output:

structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L))
#  snpname animalname alleledfA alleledfB
#1    snp2    animal3        TT        TB
#2    snp4    animal2        CA        DF
#3    snp4    animal3        CA        DF

So far I've been trying to extract additional data out of my lapply function which I use to replace the mismatches by zero, without success though. I also tried to write an ifelse function without success. Hope you guys can help me out here!

Eventually this will be run for data sets with a dimension of 100K by 1000, so efficiency is a pro

解决方案

This question has data.table tag, so here's my attempt using this package. First step is to convert row names to columns as data.table don't like those, then converting to long format after rbinding and setting an id per data set, finding where there are more than one unique value and converting back to a wide format

library(data.table)  
setDT(dfA, keep.rownames = TRUE) 
setDT(dfB, keep.rownames = TRUE)   

dcast(melt(rbind(dfA, 
                 dfB, 
                 idcol = TRUE), 
           id = 1:2
           )[, 
             if(uniqueN(value) > 1L) .SD, 
             by = .(rn, variable)], 
      rn + variable ~ .id)

#      rn variable  1  2
# 1: snp2  animal3 TT TB
# 2: snp4  animal2 CA DF
# 3: snp4  animal3 CA DF

这篇关于如何比较两个数据帧/表和提取数据在R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆