数据帧上下文中的模式匹配 [英] Pattern matching in a data frame context

查看:235
本文介绍了数据帧上下文中的模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其前5行如下所示:

 示例CCT6 GAT1 IMD3 PDR3 RIM15 
001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN
002 1111111111 111111111111111111000 000000000000 0N100111NNNN 000000000亿
003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111
004 000000NNN0 11100111111N111111111 010001000011 111111111111 011111111.11亿
005 0111100000 111111111111111111111 111111111111 0N100111NNNN 000000000亿

完整的数据集有2000个样本。我正在尝试编写代码,让我知道5列中的每一列的数字字符串是否是所有样本中的同质(即全1或0)。理想情况下,我也希望能够在答案是 True 的情况下区分1到0。从我的例子,预期的结果将是:

 样本CCT6 GAT1 IMD3 PDR3 RIM15 
001 TRUE(0)TRUE (1)FALSE FALSE FALSE
002 TRUE(1)FALSE TRUE(0)FALSE TRUE(0)
003 FALSE TRUE(0)FALSE TRUE(0)TRUE(1)
004 FALSE FALSE FALSE TRUE(1)FALSE
005 FALSE TRUE(1)TRUE(1)FALSE TRUE(0)

我不会停止使用逻辑,我可以使用字符,只要它们可以用于区分不同的类。理想的id就是将结果返回到类似的数据框。



我在这里最基本的第一步遇到麻烦,那就是让R告诉字符串是否包含所有相同的值。我试过使用各种表达式,使用 grep regexpr ,但无法获得结果,我可以用来应用整个数据框使用 ddply 或类似的东西。以下是我为此步骤尝试的一些示例:

  a = as.character(111111111111)
b = as.character(000000000000)
c = as.character(000000011110)


> grep(1,a)
[1] 1

> grep(1,c)
[1] 1

> regexpr(1,a)
[1] 1
attr(,match.length)
[1] 1
> regexpr(1,c)
[1] 8
attr(,match.length)
[1] 1
pre>

Id非常感谢任何帮助,让我开始这个问题,或帮助我完成我更大的目标。

解决方案

这是一个完整的解决方案。可能是过度杀戮,但也有乐趣。



关键位是 markTRUE 函数。它使用反向引用( \\1 )引用子字符串( 0 1 ),之前与第一个括号子表达式匹配。



正则表达式^(0 | 1)(\\1)+ $匹配以 0 1 开头的任何字符串,然后按照(直到字符串结尾)由1个或更多重复的相同字符---无论是什么。后来在调用 gsub()的同一个调用中,我使用相同的引用来替代TRUE(0)TRUE(1)



首先读取数据:

  dat < -  
read.table(textConnection(
示例CCT6 GAT1 IMD3 PDR3 RIM15
001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN
002 1111111111 111111111111111111000 000000000000 0N100111NNNN 000000000亿
003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111
004 000000NNN0 11100111111N111111111 010001000011 111111111111 011111111.11亿
005 0111100000 111111111111111111111 111111111111 0N100111NNNN 000000000\" 亿),
header = T)

然后释放正则表达式:

  markTRUE<  -  function(X){
g sub(X,pattern =^(0 | 1)(\\1)+ $,
replacement =TRUE(\\1))
}

markFALSE< - function(X){
X [!grepl(TRUE,X)]< - FALSE
return(X)
}

dat [-1]< - lapply(dat [-1],markTRUE)
dat [-1]< - lapply(dat [-1],markFALSE)

dat
#样本CCT6 GAT1 IMD3 PDR3 RIM15
#1 1 TRUE(0)TRUE(1)FALSE FALSE FALSE
#2 2 TRUE(1)FALSE FALSE FALSE TRUE 0)
#3 3 FALSE TRUE(0)FALSE TRUE(0)TRUE(1)
#4 4 FALSE FALSE FALSE TRUE(1)FALSE
#5 5 FALSE TRUE(1) TRUE(1)FALSE TRUE(0)


I have a data frame, the first 5 lines of which looks as follows:

Sample    CCT6        GAT1                   IMD3          PDR3          RIM15
001       0000000000  111111111111111111111  010001000011  0N100111NNNN  01111111111NNNNNN
002       1111111111  111111111111111111000  000000000000  0N100111NNNN  00000000000000000
003       0NNNN00000  000000000000000000000  010001000011  000000000000  11111111111111111
004       000000NNN0  11100111111N111111111  010001000011  111111111111  01111111111000000
005       0111100000  111111111111111111111  111111111111  0N100111NNNN  00000000000000000

The full data set has 2000 samples. I am trying to write code that will allow me to tell if the string of numbers for each of the 5 columns is homogenous (i.e. all 1 or 0) in all of my samples. Ideally, I'd also like to be able to differentiate between 1 and 0 in the cases where the answer is True. From my example, the expected results would be:

Sample    CCT6        GAT1         IMD3          PDR3          RIM15
001       TRUE (0)    TRUE (1)     FALSE         FALSE         FALSE
002       TRUE (1)    FALSE        TRUE (0)      FALSE         TRUE (0)
003       FALSE       TRUE (0)     FALSE         TRUE (0)      TRUE (1)
004       FALSE       FALSE        FALSE         TRUE (1)      FALSE
005       FALSE       TRUE (1)     TRUE (1)      FALSE         TRUE (0)

Im not stuck on using logicals and I could use characters as long as they can be used to differentiate between the different classes. Ideally id like to return the results in a similar data frame.

I'm having trouble with the most basic first step here which is to have R tell if the string is comprised of all the same value. Ive tried using various expressions using grep and regexpr but have been unable to get a result back that I can use to apply the the entire data frame using ddply or something similar. Here are some examples of what I've tried for this step:

a = as.character("111111111111")
b = as.character("000000000000")
c = as.character("000000011110")


> grep("1",a)
[1] 1

> grep("1",c)
[1] 1

> regexpr("1",a)
[1] 1
attr(,"match.length")
[1] 1
> regexpr("1",c)
[1] 8
attr(,"match.length")
[1] 1

Id greatly appreciate any help to get me started with this problem or help me accomplish my larger goal.

解决方案

Here's a complete solution. Probably overkill, but also kind of fun.

The key bit is the markTRUE function. It uses a backreference (\\1) to refer to the substring (either 0 or 1) that was previously matched by the first parenthesized subexpression.

The regular expression "^(0|1)(\\1)+$" says 'match any string that begins with either 0 or 1, and is then followed (until the end of the string) by 1 or more repetitions of the same character --- whatever it was'. Later in the same call to gsub(), I use the same backreference to substitute either "TRUE (0)" or "TRUE (1)", as appropriate.

First read in the data:

dat <- 
read.table(textConnection("
Sample     CCT6        GAT1                   IMD3           PDR3          RIM15
001       0000000000  111111111111111111111  010001000011  0N100111NNNN  01111111111NNNNNN
002       1111111111  111111111111111111000  000000000000  0N100111NNNN  00000000000000000
003       0NNNN00000  000000000000000000000  010001000011  000000000000  11111111111111111
004       000000NNN0  11100111111N111111111  010001000011  111111111111  01111111111000000
005       0111100000  111111111111111111111  111111111111  0N100111NNNN  00000000000000000"),
header=T)

Then unleash the regular expressions:

markTRUE <- function(X) {
    gsub(X, pattern = "^(0|1)(\\1)+$", 
         replacement = "TRUE (\\1)")
}

markFALSE <- function(X) {
    X[!grepl("TRUE", X)]  <- "FALSE"
    return(X)
}

dat[-1] <- lapply(dat[-1], markTRUE)
dat[-1] <- lapply(dat[-1], markFALSE)

dat
#   Sample     CCT6     GAT1     IMD3     PDR3    RIM15
# 1      1 TRUE (0) TRUE (1)    FALSE    FALSE    FALSE
# 2      2 TRUE (1)    FALSE    FALSE    FALSE TRUE (0)
# 3      3    FALSE TRUE (0)    FALSE TRUE (0) TRUE (1)
# 4      4    FALSE    FALSE    FALSE TRUE (1)    FALSE
# 5      5    FALSE TRUE (1) TRUE (1)    FALSE TRUE (0)

这篇关于数据帧上下文中的模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆