如何比较两个数据框/表并在 R 中提取数据? [英] How to compare two data frames/tables and extract data in R?
问题描述
为了尝试提取下面两个数据框之间的不匹配,我已经设法创建了一个新的数据框,其中替换了不匹配.
我现在需要的是一个不匹配的列表:
In attempt to extract mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced.
What I need now is a list of mismatches:
dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
# animal1 animal2 animal3
# snp1 AA AA AA
# snp2 TT TB TT
# snp3 AG AG AG
# snp4 CA CA CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
# animal1 animal2 animal3
#snp1 AA AA AA
#snp2 TT TB TB
#snp3 AG AG AG
#snp4 CA DF DF
为了澄清不匹配,这里将它们标记为 00:
To clarify the mismatches, here they are marked as 00's:
# animal1 animal2 animal3
# snp1 AA AA AA
# snp2 TT TB 00
# snp3 AG AG AG
# snp4 CA 00 00
我需要以下输出:
structure(list(snpname = structure(c(1L, 2L, 2L), .Label = c("snp2", "snp4"), class = "factor"), animalname = structure(c(2L, 1L, 2L), .Label = c("animal2", "animal3"), class = "factor"), alleledfA = structure(c(2L, 1L, 1L), .Label = c("CA", "TT"), class = "factor"), alleledfB = structure(c(2L, 1L, 1L), .Label = c("DF", "TB"), class = "factor")), .Names = c("snpname", "animalname", "alleledfA", "alleledfB"), class = "data.frame", row.names = c(NA, -3L))
# snpname animalname alleledfA alleledfB
#1 snp2 animal3 TT TB
#2 snp4 animal2 CA DF
#3 snp4 animal3 CA DF
到目前为止,我一直在尝试从 lapply
函数中提取额外的数据,我用它来将不匹配的地方替换为零,但没有成功.我还尝试编写一个 ifelse 函数但没有成功.希望大家能帮帮我!
So far I've been trying to extract additional data out of my lapply
function which I use to replace the mismatches by zero, without success though. I also tried to write an ifelse function without success. Hope you guys can help me out here!
最终这将针对维度为 100K x 1000 的数据集运行,因此效率很重要
Eventually this will be run for data sets with a dimension of 100K by 1000, so efficiency is a pro
推荐答案
这个问题有 data.table
标签,所以这是我使用这个包的尝试.第一步是将行名称转换为列,因为 data.table
不喜欢那些,然后在 rbind
ing 并为每个数据集设置一个 id 后转换为长格式,找到有多个唯一值并转换回宽格式
This question has data.table
tag, so here's my attempt using this package. First step is to convert row names to columns as data.table
don't like those, then converting to long format after rbind
ing and setting an id per data set, finding where there are more than one unique value and converting back to a wide format
library(data.table)
setDT(dfA, keep.rownames = TRUE)
setDT(dfB, keep.rownames = TRUE)
dcast(melt(rbind(dfA,
dfB,
idcol = TRUE),
id = 1:2
)[,
if(uniqueN(value) > 1L) .SD,
by = .(rn, variable)],
rn + variable ~ .id)
# rn variable 1 2
# 1: snp2 animal3 TT TB
# 2: snp4 animal2 CA DF
# 3: snp4 animal3 CA DF
这篇关于如何比较两个数据框/表并在 R 中提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!