2个表之间的高级数据匹配 [英] High-level data matching between 2 tables
问题描述
我是R的新手,我需要解决此问题的建议:
I'm new to R and I need advice on dealing with this problem:
我有2张桌子.表的开始如下所示:
I have 2 tables. The start of the tables are shown below:
表1:
SNP Gene Pval Best_SNP Best_Pval
rs2932538 ENSG00000007341 5.6007
rs10488631 ENSG00000064419 7.7461
rs12537284 ENSG00000064419 4.5544
rs3764650 ENSG00000064666 12.3401
rs10479002 ENSG00000072682 5.0141
rs6704644 ENSG00000072682 6.2306
rs2900211 ENSG00000072682 9.9022
表2:
Best_SNP Gene Best_Pval
rs9028922 ENSG00000007341 10.7892
rs8233293 ENSG00000064666 89.342
rs3234432 ENSG00000072682 32.321
rs2892334 ENSG00000064419 43.235
表1包含每个基因的SNP的完整列表.表2包含表1中出现的每个基因的最佳SNP和相应的最佳Pval.
Table 1 contains the entire list of SNPs for each gene. Table 2 contains the best SNP and the corresponding best Pval for each gene that appears in Table 1.
我要执行以下操作:将表1中的每个基因匹配到表2,然后从表2中复制Best_SNP和Best_Pval,并将它们粘贴到该基因的表1的Best_SNP和Best_Pval列下. 棘手的部分是在表1中,每个基因重复随机,不同数量的行.例如,第二个基因ENSG00000064419重复2行,而ENSG00000072682重复3行.因此,代码需要筛选基因名称,并且仅复制同一基因的Best_SNP和Best_Pval 一次.
I want to do the following: match each Gene from Table 1 to Table 2 and then copy the Best_SNP and Best_Pval from Table 2 and paste them in under Best_SNP and Best_Pval columns in Table 1 for that Gene. The tricky part is that in Table 1, each gene is repeated for a random, different number of rows. For example, the second gene ENSG00000064419 repeats for 2 rows and ENSG00000072682 repeats for 3 rows. So the code needs to filter through the names of the genes, and only copy down the Best_SNP and Best_Pval once for the same gene.
因此对于基因ENSG00000072682,在3行中,只有看起来包含该基因的第一行需要填写Best_SNP和Best_Pval列.我不希望重复的2行中的其余行也具有填充了Best_SNP和Best_Pval列.这样可以更容易地看到每个基因的起始位置和终止位置.
So for gene ENSG00000072682, out of the 3 rows, only the first row that appears to contain the gene needs to have the Best_SNP and Best_Pval columns filled in. I don't want the rest of the 2 repeated rows to also have the columns Best_SNP and Best_Pval filled in. It'll be easier to see where each gene starts and ends that way.
推荐答案
如果我正确理解了这个问题,这就是解决方案:
If I understand the question correctly, this is the solution:
x <- structure(list(SNP = c("rs2932538", "rs10488631", "rs12537284", "rs3764650",
"rs10479002", "rs6704644", "rs2900211"), Gene = c("ENSG00000007341", "ENSG00000064419",
"ENSG00000064419", "ENSG00000064666", "ENSG00000072682", "ENSG00000072682","ENSG00000072682"),
Pval= c(5.6007, 7.7461, 4.5544, 12.3401, 5.0141, 6.2306, 9.9022)), row.names= c(NA, 7L), class = "data.frame")
x
SNP Gene Pval
1 rs2932538 ENSG00000007341 5.6007
2 rs10488631 ENSG00000064419 7.7461
3 rs12537284 ENSG00000064419 4.5544
4 rs3764650 ENSG00000064666 12.3401
5 rs10479002 ENSG00000072682 5.0141
6 rs6704644 ENSG00000072682 6.2306
7 rs2900211 ENSG00000072682 9.9022
x1 <- x[!(duplicated(x$Gene) | duplicated(x$Gene, fromLast = FALSE)), ]
x1
SNP Gene Pval
1 rs2932538 ENSG00000007341 5.6007
2 rs10488631 ENSG00000064419 7.7461
4 rs3764650 ENSG00000064666 12.3401
5 rs10479002 ENSG00000072682 5.0141
y <- structure(list(Best_SNP = c("rs9028922", "rs8233293", "rs3234432", "rs2892334"), Gene = c("ENSG00000007341", "ENSG00000064666",
"ENSG00000072682", "ENSG00000064419" ),
Best_Pval= c(10.7892, 89.342, 32.321, 43.235)), row.names= c(NA, 4L), class = "data.frame")
y
Best_SNP Gene Best_Pval
1 rs9028922 ENSG00000007341 10.7892
2 rs8233293 ENSG00000064666 89.3420
3 rs3234432 ENSG00000072682 32.3210
4 rs2892334 ENSG00000064419 43.2350
merge(x1, y, by="Gene", all= FALSE)
Gene SNP Pval Best_SNP Best_Pval
1 ENSG00000007341 rs2932538 5.6007 rs9028922 10.7892
2 ENSG00000064419 rs10488631 7.7461 rs2892334 43.2350
3 ENSG00000064666 rs3764650 12.3401 rs8233293 89.3420
4 ENSG00000072682 rs10479002 5.0141 rs3234432 32.3210
这篇关于2个表之间的高级数据匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!