R正则表达式查找最后一次出现的分隔符 [英] R regex find last occurrence of delimiter

查看:60
本文介绍了R正则表达式查找最后一次出现的分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取电子邮件地址的结尾(即 .net、.com、.edu 等),但 @ 后面的部分可以有多个句点.

I'm trying to get the ending for email addresses (ie .net, .com, .edu, etc.) but the portion after the @ can have multiple periods.

library(stringi)

strings1 <- c(
    'test@aol.com',
    'test@hotmail.com',
    'test@xyz.rr.edu',
    'test@abc.xx.zz.net'
)

list1 <- stri_split_fixed(strings1, "@", 2)
df1 <- data.frame(do.call(rbind,list1))

    > list2 <- stri_split_fixed(df1$X2, '.(?!.*.)', 2);list2
[[1]]
[1] "aol.com"

[[2]]
[1] "hotmail.com"

[[3]]
[1] "xyz.rr.edu"

[[4]]
[1] "abc.xx.zz.net"

获得这样的东西的任何建议:

Any suggestions to get something like this:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

另一种尝试:

> list2 <- stri_split_fixed(df1$X2, '\.(?!.*\.)\w+', 2);list2
Error: '\.' is an unrecognized escape in character string starting "'\."

推荐答案

这里有一些方法.第一个看起来特别直接,第二个特别短.

Here are a few approaches. The first seems particularly straight foward and the second particularly short.

1) sub 这可以通过在 R 中应用 sub 来生成每一列来完成:

1) sub That can be done with a an application of sub in R to produce each column:

data.frame(X1 = sub("@.*", "", strings1), 
           X2 = sub(".*@", "", strings1), 
           X3 = sub(".*[.]", "", strings1), 
           stringsAsFactors = FALSE)

给予:

    X1            X2  X3
1 test       aol.com com
2 test   hotmail.com com
3 test    xyz.rr.edu edu
4 test abc.xx.zz.net net

2) stripplyc 这是使用特别短的 gsubfn 包的替代方法.这将返回一个字符矩阵.strappylyc 返回与括号中模式部分的匹配.第一组括号匹配@之前的所有内容,第二组括号匹配@之后的所有内容,最后一组括号匹配最后一个点之后的所有内容.

2) strapplyc Here is an alternative using the gsubfn package that is particularly short. This returns a character matrix. strappylyc returns the matches to the portions of the pattern in parentheses. The first set of parantheses matches everything before @, the second set of parentheses matches everything after @ and the last set of parentheses matches everything after the last dot.

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
t(strapplyc(strings1, pat, simplify = TRUE))

     [,1]   [,2]            [,3] 
[1,] "test" "aol.com"       "com"
[2,] "test" "hotmail.com"   "com"
[3,] "test" "xyz.rr.edu"    "edu"
[4,] "test" "abc.xx.zz.net" "net"

2a) read.pattern read.pattern 同样在 gsubfn 包中也可以使用 (2) 中定义的相同 pat 来完成:

2a) read.pattern read.pattern also in the gsubfn package can do it using the same pat defined in (2):

library(gsubfn)
pat <- "(.*)@(.*[.](.*))"
read.pattern(text = strings1, pat, as.is = TRUE)

给出一个类似于(1)的data.frame,除了列名是V1V2V3.

giving a data.frame similar to (1) except the column names are V1, V2 and V3.

3) strsplit 重叠提取使得strsplit很难做到,但我们可以通过strsplit的两个应用来做到.第一个 strsplit 在 @ 处拆分,第二个使用直到最后一个点的所有内容进行拆分.最后一个 strsplit 总是产生一个空字符串作为第一个分割字符串,我们使用 [, -1] 删除它.这给出了一个字符矩阵:

3) strsplit The overlapping extractions make it difficult to do with strsplit but we can do it with two applications of strsplit. The first strsplit splits at the @ and the second uses everything up to the last dot to split on. This last strsplit always produces an empty string as the first split string and we delete this using [, -1]. This gives a character matrix:

 ss <- function(x, pat) do.call(rbind, strsplit(x, pat))
 cbind( ss(strings1, "@"), ss(strings1, ".*[.]")[, -1] )

给出与(2)相同的答案.

giving the same answer as (2).

4) strsplit/sub 这是 (1) 和 (3) 的混合:

4) strsplit/sub This is a mix of (1) and (3):

cbind(do.call(rbind, strsplit(strings1, "@")), sub(".*[.]", "", strings1))

给出与(2)相同的答案.

giving the same answer as (2).

4a) 这是另一种使用 strsplitsub 的方法.在这里,我们在 TLD 后面附加一个 @,然后在 @ 上拆分.

4a) This is another way to use strsplit and sub. Here we append a @ followed by the TLD and then split on @.

do.call(rbind, strsplit(sub("(.*[.](.*))", "\\1@\\2", strings1), "@"))

给出与(2)相同的答案.

giving the same answer as (2).

更新添加了其他解决方案.

这篇关于R正则表达式查找最后一次出现的分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆