使用正则表达式将 URL 提取到新的数据框列中 [英] Extract URLs with regex into a new data frame column

查看:35
本文介绍了使用正则表达式将 URL 提取到新的数据框列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用正则表达式将数据框中文本中的所有 URL 提取到新列中.我有一些用于提取关键字的旧代码,因此我希望将代码调整为正则表达式.我想将正则表达式保存为字符串变量并在此处应用:

I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

似乎 fixed=FALSE 应该告诉 grepl 它是一个正则表达式,但 R 不喜欢我试图将正则表达式保存为:

It seems that fixed=FALSE should tell grepl that its a regular expression, but R doesn't like how I am trying to save the regex as:

regex <- "http.*?1-\\d+,\\d+"

我的数据组织成这样的数据框:

My data is organized in a data frame like this:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

希望看起来像:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013                        

推荐答案

Hadleyverse 解决方案(stringr 包),具有不错的 URL 模式:

Hadleyverse solution (stringr package) with a decent URL pattern:

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

如果 Content 中有多个,您可以使用 str_extract_all,但这将涉及到您之后的一些额外处理.

You can use str_extract_all if there are multiples in Content, but that will involve some extra processing on your end afterwards.

这篇关于使用正则表达式将 URL 提取到新的数据框列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆