从 R 中的字符串中删除 html 标签 [英] Removing html tags from a string in R

查看:67
本文介绍了从 R 中的字符串中删除 html 标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将网页源代码读入 R 并将其作为字符串进行处理.我试图取出段落并从段落文本中删除 html 标签.我遇到了以下问题:

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:

我尝试实现一个函数来删除 html 标签:

I tried implementing a function to remove the html tags:

cleanFun=function(fullStr)
{
 #find location of tags and citations
 tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);

 #create storage for tag strings
 tagStrings=list()

 #extract and store tag strings
 for(i in 1:dim(tagLoc)[1])
 {
   tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
 }

 #remove tag strings from paragraph
 newStr=fullStr
 for(i in 1:length(tagStrings))
 {
   newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
 }
 return(newStr)
};

这适用于某些标签但不是所有标签,失败的示例如下字符串:

This works for some tags but not all tags, an example where this fails is following string:

test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"

目标是获得:

cleanFun(test)="junk junk junk junk"

然而,这似乎不起作用.我认为这可能与字符串长度或转义字符有关,但我找不到涉及这些的解决方案.

However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.

推荐答案

这可以通过正则表达式和 grep 系列简单地实现:

This can be achieved simply through regular expressions and the grep family:

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

这也适用于同一字符串中的多个 html 标签!

This will also work with multiple html tags in the same string!

这会在 htmlString 中查找模式 <.*?> 的任何实例,并将其替换为空字符串 "".这 ?在 .*? 中使它不贪婪,所以如果你有多个标签(例如,垃圾 </a>)它将匹配 <;a> 而不是整个字符串.

This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.

这篇关于从 R 中的字符串中删除 html 标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆