数据帧上下文中的模式匹配 [英] Pattern matching in a data frame context
问题描述
我有一个数据框,其前5行如下所示:
示例CCT6 GAT1 IMD3 PDR3 RIM15
001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN
002 1111111111 111111111111111111000 000000000000 0N100111NNNN 000000000亿
003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111
004 000000NNN0 11100111111N111111111 010001000011 111111111111 011111111.11亿
005 0111100000 111111111111111111111 111111111111 0N100111NNNN 000000000亿
完整的数据集有2000个样本。我正在尝试编写代码,让我知道5列中的每一列的数字字符串是否是所有样本中的同质(即全1或0)。理想情况下,我也希望能够在答案是 True
的情况下区分1到0。从我的例子,预期的结果将是:
样本CCT6 GAT1 IMD3 PDR3 RIM15
001 TRUE(0)TRUE (1)FALSE FALSE FALSE
002 TRUE(1)FALSE TRUE(0)FALSE TRUE(0)
003 FALSE TRUE(0)FALSE TRUE(0)TRUE(1)
004 FALSE FALSE FALSE TRUE(1)FALSE
005 FALSE TRUE(1)TRUE(1)FALSE TRUE(0)
我不会停止使用逻辑,我可以使用字符,只要它们可以用于区分不同的类。理想的id就是将结果返回到类似的数据框。
我在这里最基本的第一步遇到麻烦,那就是让R告诉字符串是否包含所有相同的值。我试过使用各种表达式,使用 grep
和 regexpr
,但无法获得结果,我可以用来应用整个数据框使用 ddply
或类似的东西。以下是我为此步骤尝试的一些示例:
a = as.character(111111111111)
pre>
b = as.character(000000000000)
c = as.character(000000011110)
> grep(1,a)
[1] 1
> grep(1,c)
[1] 1
> regexpr(1,a)
[1] 1
attr(,match.length)
[1] 1
> regexpr(1,c)
[1] 8
attr(,match.length)
[1] 1
Id非常感谢任何帮助,让我开始这个问题,或帮助我完成我更大的目标。
解决方案这是一个完整的解决方案。可能是过度杀戮,但也有乐趣。
关键位是
markTRUE
函数。它使用反向引用(\\1
)引用子字符串(0
或1
),之前与第一个括号子表达式匹配。
正则表达式
^(0 | 1)(\\1)+ $
匹配以0
或1
开头的任何字符串,然后按照(直到字符串结尾)由1个或更多重复的相同字符---无论是什么。后来在调用gsub()
的同一个调用中,我使用相同的引用来替代TRUE(0)
或TRUE(1)
。
首先读取数据:
dat < -
read.table(textConnection(
示例CCT6 GAT1 IMD3 PDR3 RIM15
001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN
002 1111111111 111111111111111111000 000000000000 0N100111NNNN 000000000亿
003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111
004 000000NNN0 11100111111N111111111 010001000011 111111111111 011111111.11亿
005 0111100000 111111111111111111111 111111111111 0N100111NNNN 000000000\" 亿),
header = T)
然后释放正则表达式:
markTRUE< - function(X){
g sub(X,pattern =^(0 | 1)(\\1)+ $,
replacement =TRUE(\\1))
}
markFALSE< - function(X){
X [!grepl(TRUE,X)]< - FALSE
return(X)
}
dat [-1]< - lapply(dat [-1],markTRUE)
dat [-1]< - lapply(dat [-1],markFALSE)
dat
#样本CCT6 GAT1 IMD3 PDR3 RIM15
#1 1 TRUE(0)TRUE(1)FALSE FALSE FALSE
#2 2 TRUE(1)FALSE FALSE FALSE TRUE 0)
#3 3 FALSE TRUE(0)FALSE TRUE(0)TRUE(1)
#4 4 FALSE FALSE FALSE TRUE(1)FALSE
#5 5 FALSE TRUE(1) TRUE(1)FALSE TRUE(0)
I have a data frame, the first 5 lines of which looks as follows:
Sample CCT6 GAT1 IMD3 PDR3 RIM15 001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN 002 1111111111 111111111111111111000 000000000000 0N100111NNNN 00000000000000000 003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111 004 000000NNN0 11100111111N111111111 010001000011 111111111111 01111111111000000 005 0111100000 111111111111111111111 111111111111 0N100111NNNN 00000000000000000
The full data set has 2000 samples. I am trying to write code that will allow me to tell if the string of numbers for each of the 5 columns is homogenous (i.e. all 1 or 0) in all of my samples. Ideally, I'd also like to be able to differentiate between 1 and 0 in the cases where the answer is
True
. From my example, the expected results would be:Sample CCT6 GAT1 IMD3 PDR3 RIM15 001 TRUE (0) TRUE (1) FALSE FALSE FALSE 002 TRUE (1) FALSE TRUE (0) FALSE TRUE (0) 003 FALSE TRUE (0) FALSE TRUE (0) TRUE (1) 004 FALSE FALSE FALSE TRUE (1) FALSE 005 FALSE TRUE (1) TRUE (1) FALSE TRUE (0)
Im not stuck on using logicals and I could use characters as long as they can be used to differentiate between the different classes. Ideally id like to return the results in a similar data frame.
I'm having trouble with the most basic first step here which is to have R tell if the string is comprised of all the same value. Ive tried using various expressions using
grep
andregexpr
but have been unable to get a result back that I can use to apply the the entire data frame usingddply
or something similar. Here are some examples of what I've tried for this step:a = as.character("111111111111") b = as.character("000000000000") c = as.character("000000011110") > grep("1",a) [1] 1 > grep("1",c) [1] 1 > regexpr("1",a) [1] 1 attr(,"match.length") [1] 1 > regexpr("1",c) [1] 8 attr(,"match.length") [1] 1
Id greatly appreciate any help to get me started with this problem or help me accomplish my larger goal.
解决方案Here's a complete solution. Probably overkill, but also kind of fun.
The key bit is the
markTRUE
function. It uses a backreference (\\1
) to refer to the substring (either0
or1
) that was previously matched by the first parenthesized subexpression.The regular expression
"^(0|1)(\\1)+$"
says 'match any string that begins with either0
or1
, and is then followed (until the end of the string) by 1 or more repetitions of the same character --- whatever it was'. Later in the same call togsub()
, I use the same backreference to substitute either"TRUE (0)"
or"TRUE (1)"
, as appropriate.First read in the data:
dat <- read.table(textConnection(" Sample CCT6 GAT1 IMD3 PDR3 RIM15 001 0000000000 111111111111111111111 010001000011 0N100111NNNN 01111111111NNNNNN 002 1111111111 111111111111111111000 000000000000 0N100111NNNN 00000000000000000 003 0NNNN00000 000000000000000000000 010001000011 000000000000 11111111111111111 004 000000NNN0 11100111111N111111111 010001000011 111111111111 01111111111000000 005 0111100000 111111111111111111111 111111111111 0N100111NNNN 00000000000000000"), header=T)
Then unleash the regular expressions:
markTRUE <- function(X) { gsub(X, pattern = "^(0|1)(\\1)+$", replacement = "TRUE (\\1)") } markFALSE <- function(X) { X[!grepl("TRUE", X)] <- "FALSE" return(X) } dat[-1] <- lapply(dat[-1], markTRUE) dat[-1] <- lapply(dat[-1], markFALSE) dat # Sample CCT6 GAT1 IMD3 PDR3 RIM15 # 1 1 TRUE (0) TRUE (1) FALSE FALSE FALSE # 2 2 TRUE (1) FALSE FALSE FALSE TRUE (0) # 3 3 FALSE TRUE (0) FALSE TRUE (0) TRUE (1) # 4 4 FALSE FALSE FALSE TRUE (1) FALSE # 5 5 FALSE TRUE (1) TRUE (1) FALSE TRUE (0)
这篇关于数据帧上下文中的模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!