查找第三次出现的特殊字符并删除 R 中之前的所有内容 [英] Find third occurrence of a special character and drop everything before that in R

查看：33 发布时间：2021/9/1 18:45:12 regex r substring

本文介绍了查找第三次出现的特殊字符并删除 R 中之前的所有内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个包含 URL 的示例向量.我的目标是获取 URL 的路径.

I have this sample vector containing URLs. My goal is to obtain the path of the URL.

sample1 <- c("http://tercihblog.com/indirisu/docugard/", "http://funerariagomez.com/js/ggogle/a201209e3f79b740337b7bdb521630fe/", 
      "http://www.t-online.de/contacts/2015/08/atlas.html/", "http://mgracetimber.ie/wp-content/themes/Banner/db/box/", 
      "http://zamartrade.com/cs/DHL/DHL%20_%20Tracking.htm/", "http://dunhamengineering.com/menu/Auto-loadgoogleDrive/Document.Index/", 
      "http://www.indiegogo.com/guide/forum/2014/09/forgot-password/", 
      "http://raetc.com/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/", 
      "http://www.lidanhang.com/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&amp;hwjklxlamp;ssl=0&amp;dest/", 
      "http://www.sudaener.com/wp-includes/js/crop/dropbox/", "https://zeustracker.abuse.ch/blocklist.php/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=hostsdeny/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=iptablesblocklist/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=snort/", 
      "https://zeustracker.abuse.ch/blocklist.php?download=squiddomain/"
    )

我最初的尝试是这样的:

My initial try was this:

gsub('http://[^/]+/','/',sample1)

但是，这不适用于具有 https:// 的 URL.一个合适的解决方案是在 "/" 第三次出现之前删除所有内容.我想知道如何使用 regex 来做到这一点，以及是否有办法使用 substring 来做到这一点.

However this won't work with URLs that have https://. A suitable solution would be to drop everything before the third occurrence of"/". I was wondering how to use regexto do this and also if there is a way to do it using substring.

谢谢

推荐答案

在这里使用 gsub 确实是明智的，因为代码更清晰、更直接.

It is really advisable to go with gsub here since the code is cleaner and more straightforward.

如果要删除第 3 个 / 之前的所有内容，请使用

If you want to remove all before the 3rd /, use

> gsub('^(?:[^/]*/){3}','/',sample1)
 [1] "/indirisu/docugard/"                                                                              
 [2] "/js/ggogle/a201209e3f79b740337b7bdb521630fe/"                                                     
 [3] "/contacts/2015/08/atlas.html/"                                                                    
 [4] "/wp-content/themes/Banner/db/box/"                                                                
 [5] "/cs/DHL/DHL%20_%20Tracking.htm/"                                                                  
 [6] "/menu/Auto-loadgoogleDrive/Document.Index/"                                                       
 [7] "/guide/forum/2014/09/forgot-password/"                                                            
 [8] "/wp-admin/Service/clients/votre-compte/en-ligne/imp-rem.fr/"                                      
 [9] "/img/?https://secure.runescape.com/m=weblogin/loginform.ws?mod=www&amp;hwjklxlamp;ssl=0&amp;dest/"
[10] "/wp-includes/js/crop/dropbox/"                                                                    
[11] "/blocklist.php/"                                                                                  
[12] "/blocklist.php?download=hostsdeny/"                                                               
[13] "/blocklist.php?download=iptablesblocklist/"                                                       
[14] "/blocklist.php?download=snort/"                                                                   
[15] "/blocklist.php?download=squiddomain/"

^(?:[^/]*/){3} 匹配:

^ - 字符串的开始
(?:[^/]*/){3} - 正好出现 3 次:
- [^/]* - 除 /
- / - 文字 / 字符.
- ^ - start of string
- (?:[^/]*/){3} - exactly 3 occurrences of:
  - [^/]* - zero or more characters other than /
  - / - a literal / character.
  Cath 建议更精确的正则表达式修复，但也许您想在开头添加 ^ 以仅匹配字符串的开头:
  
  Cath suggests a more precise your regex fix, but perhaps, you'd like to add ^ at the start to only match at the beginning of the string:
```
gsub('^https?://[^/]+/','/',sample1)
      ^     ^
```
  ?(贪婪)量词表示出现一次或零次，从而使http之后的s成为可选.它等同于(但比)gsub('^(https|http)://[^/]+/','/',sample1).
  
  The ? (greedy) quantifier means one or zero occurrences, thus making the s after http optional. It is identical to (but more efficient than) gsub('^(https|http)://[^/]+/','/',sample1).
  
  您可能还想让正则表达式不区分大小写，添加 ignore.case = TRUE.
  
  You may also want to make your regex case-insensitive, add ignore.case = TRUE.
  
  这篇关于查找第三次出现的特殊字符并删除 R 中之前的所有内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查找第三次出现的特殊字符并删除 R 中之前的所有内容 [英] Find third occurrence of a special character and drop everything before that in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查找第三次出现的特殊字符并删除 R 中之前的所有内容 [英] Find third occurrence of a special character and drop everything before that in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭