如何使用高级字符串匹配对数据进行子集化 [英] How to subset data with advance string matching

查看:36
本文介绍了如何使用高级字符串匹配对数据进行子集化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框,我想根据匹配的字符串从中提取行.

I have the following data frame from which I would like to extract rows based on matching strings.

> GEMA_EO5
gene_symbol  fold_EO  p_value                           RefSeq_ID      BH_p_value
       KNG1 3.433049 8.56e-28              NM_000893,NM_001102416    1.234245e-24
      REXO4 3.245317 1.78e-27                           NM_020385    2.281367e-24
      VPS29 3.827665 2.22e-25                 NM_057180,NM_016226    2.560770e-22
    CYP51A1 3.363149 5.95e-25              NM_000786,NM_001146152    6.239386e-22
      TNPO2 4.707600 1.60e-23 NM_001136195,NM_001136196,NM_013433    1.538000e-20
      NSDHL 2.703922 6.74e-23              NM_001129765,NM_015922    5.980454e-20
     DPYSL2 5.097382 1.29e-22                           NM_001386    1.062868e-19

所以我想提取例如基于 $RefSeq_ID 中匹配字符串的两行,适用于以下内容:

So I would like to extract e.g. two rows based on matching strings in $RefSeq_ID, that works fine with the following:

> list<-c("NM_001386", "NM_020385")
> GEMA_EO6<-subset(GEMA_EO5, GEMA_EO5$RefSeq_ID %in% list, drop = TRUE)

> GEMA_EO6

gene_symbol  fold_EO  p_value RefSeq_ID    BH_p_value
      REXO4 3.245317 1.78e-27 NM_020385  2.281367e-24
     DPYSL2 5.097382 1.29e-22 NM_001386  1.062868e-19

但有些行有几个用逗号分隔的 RefSeq_ID,所以我正在寻找一种通用方法来判断 $RefSeq_ID 是否包含某个字符串模式,然后对该行进行子集化.

But some of the rows have several RefSeq_IDs separated with commas, so I am looking for a general way of telling if $RefSeq_ID contains a certain string pattern and then subset that row.

推荐答案

要进行部分匹配,您需要使用正则表达式(请参阅 ?grepl).这是您的特定问题的解决方案:

To do partial matching you'll need to use regular expressions (see ?grepl). Here's a solution to your particular problem:

##Notice that the first element appears in 
##a row containing commas
l = c( "NM_013433", "NM_001386", "NM_020385")

要一次测试一个序列,我们只需选择一个特定的 seq id:

To test one sequence at a time, we just select a particular seq id:

R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
  gene_symbol fold_EO p_value                           RefSeq_ID BH_p_value
5       TNPO2   4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433  1.538e-20

为了检测多个基因,我们使用 | 操作符:

To test for multiple genes, we use the | operator:

R> paste(l, collapse="|")
[1] "NM_013433|NM_001386|NM_020385"
R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE

所以

subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))

应该给你你想要的.

这篇关于如何使用高级字符串匹配对数据进行子集化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆