如何使用正则表达式从R中的文本字符串查找/替换所有URL/链接 [英] How to find/replace all URLs/links from a text strings in R using regex

查看:197
本文介绍了如何使用正则表达式从R中的文本字符串查找/替换所有URL/链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中有n行,每行都是一个字符串.

I have a text file with n rows, each row being a character string.

我想将其导入到 R 中,并使用正则表达式顺序删除所有以http开头(特别是)的URL.

I would like to import this into R and sequentially remove all URLs beginning (specifically) with http using regex.

以下内容在交互式正则表达式检查器(Emacs中的重新生成器)中起作用,但在R中不起作用.

The following worked within an interactive regex checker (re-builder in Emacs), but not within R.

gsub("http:.*?[([:space:])| |\n]", "", x))

注意

这个问题和我在下面给出的答案来自有关正则表达式引擎及其相互兼容性的一个问题.

This question and my given answer below stem from this question about regex engines and their compatibility with one another.

推荐答案

我的解决方案如下:

output <- sapply(input, FUN = function(x) gsub("http\\S+\\s*", "", x))

  • sapply对数据帧的每一行(对于我而言)执行功能 S imply .
  • gsub使用正则表达式查找并删除每个链接,方法是将其替换为 nothing :""
  • 正则表达式:"http\\S+\\s*":

    • sapply performs a function Simply over each row of a data frame (in my case).
    • gsub uses regex to find each link and removes it, by substituting it with nothing: ""
    • the regex: "http\\S+\\s*" :

      1. "http"查找在input
      2. 中所有出现的"http"
      3. "\ S +"从http继续到所有非空白字符
      4. "\ s *"找到一个或多个白色字符后结束搜索
      1. "http" finds all occurences of "http" within input
      2. "\S+" continues on from http through all none-whitespace characters
      3. "\s*" ends the search when one or more whitepace characters are found

    • 结尾的x只是函数定义FUN指向sapply函数中的输入.

    • the trailing x is just the input that the function definition FUN points to within the sapply function.

      我认为主要的收获(至少对我而言)是R中使用双反斜杠.例如,使用以下正则表达式,我能够删除Emacs中的所有URL交互式正则表达式检查器(Emacs命令:M-x re-builder),但不在R:

      I think the main take away (for me, at least) is the usage of double backslash within R. For example, using the following regex, I was able to remove all URLs within the Emacs interactive regex checker (Emacs command: M-x re-builder), but not in R:

      "http:.*?[([:space:])| |\n]"
      

      我不确定如何执行此操作,因为与目标文本交互地测试正则表达式是很多免费的在线 工具,但是R使用自己的正则表达式. 可以使用Perl(5.x版)正则表达式引擎,但是我在下面的回答中避免了这种情况.

      I wasn't myself sure how to do this as testing regex expressions interactively with the target text is many free online tools, but R uses its own flavour of regex. It is possible to use a Perl (version 5.x) regex engine, but my answer below avoids this.

      The short discussion in this thread may prove useful in explaining this all.

      这篇关于如何使用正则表达式从R中的文本字符串查找/替换所有URL/链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆