如何使用正则表达式从R中的文本字符串查找/替换所有URL/链接 [英] How to find/replace all URLs/links from a text strings in R using regex
问题描述
我有一个文本文件,其中有n
行,每行都是一个字符串.
I have a text file with n
rows, each row being a character string.
我想将其导入到 R 中,并使用正则表达式顺序删除所有以http
开头(特别是)的URL.
I would like to import this into R and sequentially remove all URLs beginning (specifically) with http
using regex.
以下内容在交互式正则表达式检查器(Emacs中的重新生成器)中起作用,但在R中不起作用.
The following worked within an interactive regex checker (re-builder in Emacs), but not within R.
gsub("http:.*?[([:space:])| |\n]", "", x))
注意
这个问题和我在下面给出的答案来自有关正则表达式引擎及其相互兼容性的一个问题.
This question and my given answer below stem from this question about regex engines and their compatibility with one another.
推荐答案
我的解决方案如下:
output <- sapply(input, FUN = function(x) gsub("http\\S+\\s*", "", x))
-
sapply
对数据帧的每一行(对于我而言)执行功能 S imply . -
gsub
使用正则表达式查找并删除每个链接,方法是将其替换为 nothing :""
-
正则表达式:
"http\\S+\\s*"
:sapply
performs a function Simply over each row of a data frame (in my case).gsub
uses regex to find each link and removes it, by substituting it with nothing:""
the regex:
"http\\S+\\s*"
:- "http"查找在
input
中所有出现的"http"
- "\ S +"从http继续到所有非空白字符
- "\ s *"找到一个或多个白色字符后结束搜索
- "http" finds all occurences of "http" within
input
- "\S+" continues on from http through all none-whitespace characters
- "\s*" ends the search when one or more whitepace characters are found
- "http"查找在
-
结尾的
x
只是函数定义FUN
指向sapply
函数中的输入. the trailing
x
is just the input that the function definitionFUN
points to within thesapply
function.我认为主要的收获(至少对我而言)是R中使用双反斜杠.例如,使用以下正则表达式,我能够删除Emacs中的所有URL交互式正则表达式检查器(Emacs命令:
M-x re-builder
),但不在R:I think the main take away (for me, at least) is the usage of double backslash within R. For example, using the following regex, I was able to remove all URLs within the Emacs interactive regex checker (Emacs command:
M-x re-builder
), but not in R:"http:.*?[([:space:])| |\n]"
我不确定如何执行此操作,因为与目标文本交互地测试正则表达式是很多免费的在线 工具,但是R使用自己的正则表达式. 可以使用Perl(5.x版)正则表达式引擎,但是我在下面的回答中避免了这种情况.
I wasn't myself sure how to do this as testing regex expressions interactively with the target text is many free online tools, but R uses its own flavour of regex. It is possible to use a Perl (version 5.x) regex engine, but my answer below avoids this.
此线程可能在解释所有这些方面很有用.
The short discussion in this thread may prove useful in explaining this all.
这篇关于如何使用正则表达式从R中的文本字符串查找/替换所有URL/链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!