如何基于部分字符串匹配与 R 合并两个数据框? [英] How to merge two data frame based on partial string match with R?
问题描述
我有两个数据框:
第一个包含大量蛋白质,我对其进行了多次计算.这里有一个例子:
the first one contains a huge number of proteins for which I have made several calculations. here an example:
>Accession Description # Peptides A2 # PSM A2 # Peptides B2 # PSM B2 # Peptides C2 # PSM C2 # Peptides D2 # PSM D2 # Peptides E2 # PSM E2 # AAs MW [kDa] calc. pI
P01837 Ig kappa chain C region OS=Mus musculus PE=1 SV=1 - [IGKC_MOUSE] 10 319 8 128 8 116 7 114 106 11,8 5,41
P01868 Ig gamma-1 chain C region secreted form OS=Mus musculus GN=Ighg1 PE=1 SV=1 - [IGHG1_MOUSE] 13 251 15 122 16 116 16 108 324 35,7 7,40
P60710 Actin, cytoplasmic 1 OS=Mus musculus GN=Actb PE=1 SV=1 - [ACTB_MOUSE] 15 215 10 37 11 30 11 31 16 154 375 41,7 5,48
第二个包含感兴趣的蛋白质.这里有一个例子:
the second contains the proteins of interest. here an example:
>complex Description Accession protein
TFIID [TAF1_MOUSE] Q80UV9-3 Isoform 3 of Transcription initiation factor TFIID subunit 1 OS=Mus musculus GN=Taf1 - [TAF1_MOUSE]
TFIID [TAF2_MOUSE] Q8C176 Transcription initiation factor TFIID subunit 2 OS=Mus musculus GN=Taf2 PE=2 SV=2 - [TAF2_MOUSE]
TFIID [TAF3_MOUSE] Q5HZG4 Transcription initiation factor TFIID subunit 3 OS=Mus musculus GN=Taf3 PE=1 SV=2 - [TAF3_MOUSE]
我想要做的是:获取一个数据框,其中包含我对感兴趣的蛋白质的计算值.我第一次尝试使用:
What I want to do: get one data frame containing the values from my calculations for the proteins of interest only. In a first attempt I used:
fusion <- merge.data.frame(x=tableaucleanIPTAFXwoNA, y=sublist, by.x="Description", by.y="protein", all =FALSE)
然而,两个数据帧之间的蛋白质名称命名法不同,使用合并功能这不起作用.
However, the nomenclature of the protein names are different between the two data frames and using the merge function this does not work.
那么,当TAF10"是转录起始因子 TFIID 亚基 10 OS=Mus musculus GN=Taf10 PE=1 SV=1 - [TAF10_MOUSE]"字符串文本的一部分时,我如何执行它的部分匹配?换句话说,我希望 R 只识别整个字符串中的一部分.
So, how could I perform a partial match for "TAF10" when it is part of "Transcription initiation factor TFIID subunit 10 OS=Mus musculus GN=Taf10 PE=1 SV=1 - [TAF10_MOUSE]" string text ? In other words I want R recognizes only a piece o f the whole string.
我尝试使用 grep 函数:
I tried to use grep function:
idx2 <- sapply("tableaucleanIPTAFX$Description", grep, "sublist$Description")
但是,我明白了:
as.data.frame(idx2)
[1] tableaucleanIPTAFX.Description
<0 rows> (or 0-length row.names)
我猜是,模式没有被正确识别......然后我访问了RegExr网站,写了一个正则表达式,以便可以识别我的id名称.我发现这可以将 [TRRAP_MOUSE] 识别为
I guess that, the pattern is not correctly recognized... Then I visited the RegExr website to write a regular expression so that my id names can be recognized. I found that this works to recognize [TRRAP_MOUSE] into
转化/转录域相关蛋白 OS=Mus musculus GN=Trrap PE=1 SV=2 - [TRRAP_MOUSE]:
Transformation/transcription domain-associated protein OS=Mus musculus GN=Trrap PE=1 SV=2 - [TRRAP_MOUSE] :
与
/(TRRAP_[MOUSE])w+/g
我想知道如何将它实现到我的 id 列表(我的示例中的描述"列)?
I wonder how I can implement it to my id list (the "Description" column in my example) ?
推荐答案
这可能对您有用,并且可以处理重复项:
This might work for you and it handles duplicates:
首先是一些虚拟数据:
df1 <- data.frame(name=c("George", "Abraham", "Barack"), stringsAsFactors = F)
df2 <- data.frame(president=c("Thanks, Obama (Barack)","Lincoln, Abraham, George""George Washington"), stringsAsFactors = F)
使用 grep
查找完整描述中的代码:
Find the code in the full description using grep
:
idx2 <- sapply(df1$name, grep, df2$president)
如果多个描述与代码匹配,这可能会导致多个匹配,所以在这里我复制原始索引以便结果对齐:
This can result in multiple matches if multiple descriptions match the code so here I duplicate the original indices so the results align:
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))
合并"数据集,cbind
在新索引上对齐:
"merge" the datasets with cbind
aligned on the new indices:
> cbind(df1[unlist(idx1),,drop=F], df2[unlist(idx2),,drop=F])
name president
1 George Lincoln, Abraham, George
1.1 George George Washington
2 Abraham Lincoln, Abraham, George
3 Barack Thanks, Obama (Barack)
这篇关于如何基于部分字符串匹配与 R 合并两个数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!