如何比较两个数据帧/表和提取数据在R? [英] How to compare two data frames/tables and extract data in R?
问题描述
尝试提取下面两个数据帧之间的不匹配我已经设法创建一个新的数据帧,其中不匹配被替换。
现在我需要的是一个不匹配列表:
dfA < - structure(list(animal1 = c(AA,TT,AG,CA) ,动物2 = c(AA,TB,AG,CA),animal3 = c(AA,TT,AG,CA)).Names = c动物2,动物3),row.names = c(snp1,snp2,snp3,snp4),class =data.frame)
# dfA
#animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TT
#snp3 AG AG AG
#snp4 CA CA CA
dfB < - structure(list(animal1 = c(AA,TT,AG,CA),animal2 = c(AA,TB,AG,DF), = c(AA,TB,AG,DF)).Names = c(animal1,animal2,animal3),row.names = c(snp1 snp2,snp3,snp4),class =data.frame)
# dfB
#animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TB
#snp3 AG AG AG
#snp4 CA DF DF
为了澄清不匹配,这里将它们标记为00:
#animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB 00
#snp3 AG AG AG
#snp4 CA 00 00
我需要以下输出:
structure(list(snpname = structure(c(1L,2L,2L),.Label = c(snp2,snp4),class =factor),animalname = structure (c(2L,1L,2L),.Label = c(animal2,animal3),class =factor),alleledfA = CA,TT),类别=因子),等位基因fB =结构(c(2L,1L,1L),.Label = c(DF,TB .names = c(snpname,animalname,alleledfA,alleledfB),class =data.frame,row.names = c(NA,-3L))
#snpname animalname alleledfA alleledfB
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF
到目前为止,我一直试图从我的
lapply
函数中提取额外的数据,我用它来替换不匹配的零,而不是成功。我也试图写一个ifelse函数没有成功。希望你们能在这里帮助我!
最终,这将针对尺寸为100K乘以1000的数据集运行,因此效率是专业版
解决方案这个问题有
data.table
标签,所以这里是我使用这个包的尝试。第一步是将行名转换为列data.table
不喜欢那些,然后转换为长格式rbind $ c $为每个数据集设置一个ID,找到有多个唯一值的地方并转换回宽格式
library(data.table)
setDT(dfA,keep.rownames = TRUE)
setDT(dfB,keep.rownames = TRUE)
dcast ,
dfB,
idcol = TRUE),
id = 1:2
)[,
if(uniqueN(value)> 1L).SD,
by =。(rn,variable)],
rn + variable〜.id)
#rn变量1 2
#1:snp2 animal3 TT TB
#2:snp4 animal2 CA DF
#3:snp4 animal3 CA DF
In attempt to extract mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced.
What I need now is a list of mismatches:dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame") # > dfA # animal1 animal2 animal3 # snp1 AA AA AA # snp2 TT TB TT # snp3 AG AG AG # snp4 CA CA CA dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame") #> dfB # animal1 animal2 animal3 #snp1 AA AA AA #snp2 TT TB TB #snp3 AG AG AG #snp4 CA DF DF
To clarify the mismatches, here they are marked as 00's:
# animal1 animal2 animal3 # snp1 AA AA AA # snp2 TT TB 00 # snp3 AG AG AG # snp4 CA 00 00
I need the following output:
structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L)) # snpname animalname alleledfA alleledfB #1 snp2 animal3 TT TB #2 snp4 animal2 CA DF #3 snp4 animal3 CA DF
So far I've been trying to extract additional data out of my
lapply
function which I use to replace the mismatches by zero, without success though. I also tried to write an ifelse function without success. Hope you guys can help me out here!Eventually this will be run for data sets with a dimension of 100K by 1000, so efficiency is a pro
解决方案This question has
data.table
tag, so here's my attempt using this package. First step is to convert row names to columns asdata.table
don't like those, then converting to long format afterrbind
ing and setting an id per data set, finding where there are more than one unique value and converting back to a wide formatlibrary(data.table) setDT(dfA, keep.rownames = TRUE) setDT(dfB, keep.rownames = TRUE) dcast(melt(rbind(dfA, dfB, idcol = TRUE), id = 1:2 )[, if(uniqueN(value) > 1L) .SD, by = .(rn, variable)], rn + variable ~ .id) # rn variable 1 2 # 1: snp2 animal3 TT TB # 2: snp4 animal2 CA DF # 3: snp4 animal3 CA DF
这篇关于如何比较两个数据帧/表和提取数据在R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!