R正则表达式查找最后一次出现的分隔符 [英] R regex find last occurrence of delimiter
问题描述
我正在尝试获取电子邮件地址的结尾(即 .net、.com、.edu 等),但 @ 后面的部分可以有多个句点.
I'm trying to get the ending for email addresses (ie .net, .com, .edu, etc.) but the portion after the @ can have multiple periods.
library(stringi)
strings1 <- c(
'test@aol.com',
'test@hotmail.com',
'test@xyz.rr.edu',
'test@abc.xx.zz.net'
)
list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))
> list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"
[[2]]
[1] "hotmail.com"
[[3]]
[1] "xyz.rr.edu"
[[4]]
[1] "abc.xx.zz.net"
获得这样的东西的任何建议:
Any suggestions to get something like this:
X1 X2 X3
1 test aol.com com
2 test hotmail.com com
3 test xyz.rr.edu edu
4 test abc.xx.zz.net net
另一种尝试:
> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."
推荐答案
这里有一些方法.第一个看起来特别直接,第二个特别短.
Here are a few approaches. The first seems particularly straight foward and the second particularly short.
1) sub 这可以通过在 R 中应用 sub
来生成每一列来完成:
1) sub That can be done with a an application of sub
in R to produce each column:
data.frame(X1 = sub("@.*", "", strings1),
X2 = sub(".*@", "", strings1),
X3 = sub(".*[.]", "", strings1),
stringsAsFactors = FALSE)
给予:
X1 X2 X3
1 test aol.com com
2 test hotmail.com com
3 test xyz.rr.edu edu
4 test abc.xx.zz.net net
2) stripplyc 这是使用特别短的 gsubfn 包的替代方法.这将返回一个字符矩阵.strappylyc
返回与括号中模式部分的匹配.第一组括号匹配@之前的所有内容,第二组括号匹配@之后的所有内容,最后一组括号匹配最后一个点之后的所有内容.
2) strapplyc Here is an alternative using the gsubfn package that is particularly short. This returns a character matrix. strappylyc
returns the matches to the portions of the pattern in parentheses. The first set of parantheses matches everything before @, the second set of parentheses matches everything after @ and the last set of parentheses matches everything after the last dot.
library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))
[,1] [,2] [,3]
[1,] "test" "aol.com" "com"
[2,] "test" "hotmail.com" "com"
[3,] "test" "xyz.rr.edu" "edu"
[4,] "test" "abc.xx.zz.net" "net"
2a) read.pattern read.pattern
同样在 gsubfn 包中也可以使用 (2) 中定义的相同 pat
来完成:
2a) read.pattern read.pattern
also in the gsubfn package can do it using the same pat
defined in (2):
library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)
给出一个类似于(1)的data.frame,除了列名是V1
、V2
和V3
.
giving a data.frame similar to (1) except the column names are V1
, V2
and V3
.
3) strsplit 重叠提取使得strsplit
很难做到,但我们可以通过strsplit
的两个应用来做到.第一个 strsplit
在 @ 处拆分,第二个使用直到最后一个点的所有内容进行拆分.最后一个 strsplit
总是产生一个空字符串作为第一个分割字符串,我们使用 [, -1]
删除它.这给出了一个字符矩阵:
3) strsplit The overlapping extractions make it difficult to do with strsplit
but we can do it with two applications of strsplit
. The first strsplit
splits at the @ and the second uses everything up to the last dot to split on. This last strsplit
always produces an empty string as the first split string and we delete this using [, -1]
. This gives a character matrix:
ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )
给出与(2)相同的答案.
giving the same answer as (2).
4) strsplit/sub 这是 (1) 和 (3) 的混合:
4) strsplit/sub This is a mix of (1) and (3):
cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))
给出与(2)相同的答案.
giving the same answer as (2).
4a) 这是另一种使用 strsplit
和 sub
的方法.在这里,我们在 TLD 后面附加一个 @,然后在 @ 上拆分.
4a) This is another way to use strsplit
and sub
. Here we append a @ followed by the TLD and then split on @.
do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))
给出与(2)相同的答案.
giving the same answer as (2).
更新添加了其他解决方案.
这篇关于R正则表达式查找最后一次出现的分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!