如何基于R的部分字符串匹配合并两个数据帧? [英] How to merge two data frame based on partial string match with R?

查看:119
本文介绍了如何基于R的部分字符串匹配合并两个数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框:

第一个包含大量的蛋白质,我对此进行了多次计算.这里是一个例子:

the first one contains a huge number of proteins for which I have made several calculations. here an example:

>Accession  Description # Peptides A2   # PSM A2    # Peptides B2   # PSM B2    # Peptides C2   # PSM C2    # Peptides D2   # PSM D2    # Peptides E2   # PSM E2    # AAs   MW [kDa]    calc. pI
P01837  Ig kappa chain C region OS=Mus musculus PE=1 SV=1 - [IGKC_MOUSE]    10  319 8   128 8   116 7   114         106 11,8    5,41
P01868  Ig gamma-1 chain C region secreted form OS=Mus musculus GN=Ighg1 PE=1 SV=1 - [IGHG1_MOUSE]  13  251 15  122 16  116 16  108         324 35,7    7,40
P60710  Actin, cytoplasmic 1 OS=Mus musculus GN=Actb PE=1 SV=1 - [ACTB_MOUSE]   15  215 10  37  11  30  11  31  16  154 375 41,7    5,48

第二个包含目的蛋白.这里是一个例子:

the second contains the proteins of interest. here an example:

>complex    Description Accession   protein
TFIID   [TAF1_MOUSE]    Q80UV9-3    Isoform 3 of Transcription initiation factor TFIID subunit 1 OS=Mus musculus GN=Taf1 - [TAF1_MOUSE]
TFIID   [TAF2_MOUSE]    Q8C176  Transcription initiation factor TFIID subunit 2 OS=Mus musculus GN=Taf2 PE=2 SV=2 - [TAF2_MOUSE]
TFIID   [TAF3_MOUSE]    Q5HZG4  Transcription initiation factor TFIID subunit 3 OS=Mus musculus GN=Taf3 PE=1 SV=2 - [TAF3_MOUSE]

我想做的:从我的计算中得到一个仅包含目标蛋白质值的数据框.在第一次尝试中,我使用了:

What I want to do: get one data frame containing the values from my calculations for the proteins of interest only. In a first attempt I used:

fusion <- merge.data.frame(x=tableaucleanIPTAFXwoNA, y=sublist, by.x="Description", by.y="protein", all =FALSE)

但是,两个数据框之间蛋白质名称的命名方式是不同的,并且使用合并功能无法正常工作.

However, the nomenclature of the protein names are different between the two data frames and using the merge function this does not work.

那么,当"TAF10"是转录起始因子TFIID亚基10 OS =小家鼠GN = Taf10 PE = 1 SV = 1-[TAF10_MOUSE]"字符串文本的一部分时,我该如何部分匹配? 换句话说,我希望R只能识别整个字符串中的一部分.

So, how could I perform a partial match for "TAF10" when it is part of "Transcription initiation factor TFIID subunit 10 OS=Mus musculus GN=Taf10 PE=1 SV=1 - [TAF10_MOUSE]" string text ? In other words I want R recognizes only a piece o f the whole string.

我尝试使用grep函数:

I tried to use grep function:

idx2 <- sapply("tableaucleanIPTAFX$Description", grep, "sublist$Description")  

但是,我明白了:

as.data.frame(idx2)
[1] tableaucleanIPTAFX.Description
<0 rows> (or 0-length row.names)

我猜想,该模式未被正确识别...然后,我访问了RegExr网站以编写一个正则表达式,以便可以识别我的ID名称.我发现这样做可以将[TRRAP_MOUSE]识别为

I guess that, the pattern is not correctly recognized... Then I visited the RegExr website to write a regular expression so that my id names can be recognized. I found that this works to recognize [TRRAP_MOUSE] into

与转化/转录域相关的蛋白OS =小家鼠GN =陷阱PE = 1 SV = 2-[TRRAP_MOUSE] :

Transformation/transcription domain-associated protein OS=Mus musculus GN=Trrap PE=1 SV=2 - [TRRAP_MOUSE] :

使用

 /(TRRAP_[MOUSE])\w+/g

我想知道如何将其实现到我的ID列表(示例中的"Description"列)吗?

I wonder how I can implement it to my id list (the "Description" column in my example) ?

推荐答案

这可能对您有用,并且可以处理重复项:

This might work for you and it handles duplicates:

首先是一些伪数据:

df1 <- data.frame(name=c("George", "Abraham", "Barack"), stringsAsFactors = F)
df2 <- data.frame(president=c("Thanks, Obama (Barack)","Lincoln, Abraham, George""George Washington"), stringsAsFactors = F)

使用grep查找完整描述中的代码:

Find the code in the full description using grep:

idx2 <- sapply(df1$name, grep, df2$president)

如果多个描述与代码匹配,则可能导致多个匹配,因此在这里我重复原始索引,以便结果对齐:

This can result in multiple matches if multiple descriptions match the code so here I duplicate the original indices so the results align:

idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))

合并" cbind与新索引对齐的数据集:

"merge" the datasets with cbind aligned on the new indices:

> cbind(df1[unlist(idx1),,drop=F], df2[unlist(idx2),,drop=F])
       name                president
1    George Lincoln, Abraham, George
1.1  George        George Washington
2   Abraham Lincoln, Abraham, George
3    Barack   Thanks, Obama (Barack)

这篇关于如何基于R的部分字符串匹配合并两个数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆