查找第三次出现的特殊字符并删除 R 中之前的所有内容 [英] Find third occurrence of a special character and drop everything before that in R
问题描述
我有这个包含 URL 的示例向量.我的目标是获取 URL 的路径.
I have this sample vector containing URLs. My goal is to obtain the path of the URL.
sample1 <- c("http://tercihblog.com/indirisu/docugard/", "http://funerariagomez.com/js/ggogle/a201209e3f79b740337b7bdb521630fe/",
"http://www.t-online.de/contacts/2015/08/atlas.html/", "http://mgracetimber.ie/wp-content/themes/Banner/db/box/",
"http://zamartrade.com/cs/DHL/DHL%20_%20Tracking.htm/", "http://dunhamengineering.com/menu/Auto-loadgoogleDrive/Document.Index/",
"http://www.indiegogo.com/guide/forum/2014/09/forgot-password/",
"http://raetc.com/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/",
"http://www.lidanhang.com/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&hwjklxlamp;ssl=0&dest/",
"http://www.sudaener.com/wp-includes/js/crop/dropbox/", "https://zeustracker.abuse.ch/blocklist.php/",
"https://zeustracker.abuse.ch/blocklist.php?download=hostsdeny/",
"https://zeustracker.abuse.ch/blocklist.php?download=iptablesblocklist/",
"https://zeustracker.abuse.ch/blocklist.php?download=snort/",
"https://zeustracker.abuse.ch/blocklist.php?download=squiddomain/"
)
我最初的尝试是这样的:
My initial try was this:
gsub('http://[^/]+/','/',sample1)
但是,这不适用于具有 https://
的 URL.一个合适的解决方案是在 "/"
第三次出现之前删除所有内容.我想知道如何使用 regex
来做到这一点,以及是否有办法使用 substring
来做到这一点.
However this won't work with URLs that have https://
. A suitable solution would be to drop everything before the third occurrence of"/"
. I was wondering how to use regex
to do this and also if there is a way to do it using substring
.
谢谢
推荐答案
在这里使用 gsub
确实是明智的,因为代码更清晰、更直接.
It is really advisable to go with gsub
here since the code is cleaner and more straightforward.
如果要删除第 3 个 /
之前的所有内容,请使用
If you want to remove all before the 3rd /
, use
> gsub('^(?:[^/]*/){3}','/',sample1)
[1] "/indirisu/docugard/"
[2] "/js/ggogle/a201209e3f79b740337b7bdb521630fe/"
[3] "/contacts/2015/08/atlas.html/"
[4] "/wp-content/themes/Banner/db/box/"
[5] "/cs/DHL/DHL%20_%20Tracking.htm/"
[6] "/menu/Auto-loadgoogleDrive/Document.Index/"
[7] "/guide/forum/2014/09/forgot-password/"
[8] "/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/"
[9] "/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&hwjklxlamp;ssl=0&dest/"
[10] "/wp-includes/js/crop/dropbox/"
[11] "/blocklist.php/"
[12] "/blocklist.php?download=hostsdeny/"
[13] "/blocklist.php?download=iptablesblocklist/"
[14] "/blocklist.php?download=snort/"
[15] "/blocklist.php?download=squiddomain/"
^(?:[^/]*/){3}
匹配:
^
- 字符串的开始(?:[^/]*/){3}
- 正好出现 3 次:[^/]*
- 除/
之外的零个或多个字符/
- 文字/
字符.
^
- start of string(?:[^/]*/){3}
- exactly 3 occurrences of:[^/]*
- zero or more characters other than/
/
- a literal/
character.
Cath 建议 更精确的正则表达式修复,但也许您想在开头添加
^
以仅匹配字符串的开头:Cath suggests a more precise your regex fix, but perhaps, you'd like to add
^
at the start to only match at the beginning of the string:gsub('^https?://[^/]+/','/',sample1) ^ ^
?
(贪婪)量词表示出现一次或零次,从而使http
之后的s
成为可选.它等同于(但比)gsub('^(https|http)://[^/]+/','/',sample1)
.The
?
(greedy) quantifier means one or zero occurrences, thus making thes
afterhttp
optional. It is identical to (but more efficient than)gsub('^(https|http)://[^/]+/','/',sample1)
.您可能还想让正则表达式不区分大小写,添加
ignore.case = TRUE
.You may also want to make your regex case-insensitive, add
ignore.case = TRUE
.这篇关于查找第三次出现的特殊字符并删除 R 中之前的所有内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!