使用 R 中剩余的链接标签将 HTML 解析为文本 [英] Parsing HTML to text with link-tags remaining in R

查看:25
本文介绍了使用 R 中剩余的链接标签将 HTML 解析为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 HTML 文件(通过 Google Drive API 下载为 text/html)解析为 R 中的列表.

I am trying to parse a HTML file (downloaded via Google Drive API as text/html) to a list in R.

HTML 看起来像这样(对不起,德语内容):

The HTML looks like this (sorry for the German content):

<p style='padding:0;margin:0;color:#000000;font-size:11pt;font-
family:"Arial";line-height:1.15;orphans:2;widows:2;text-align:left'>
<span>text: Das </span>
<span style="color:#1155cc;text-decoration:underline"><a 
href="https://www.google.com/url?q=http://www.bundesverfassungsgericht.de/SharedDocs/Entscheidungen/DE/2011/10/rs20111012_2bvr023608.html&amp;sa=D&amp;ust=1503574789125000&amp;usg=AFQ
jCNE4Ij3mvMX-QttYQYqspAaMxaZaeg" style="color:inherit;text-
decoration:inherit">Verfassungsgericht urteilt</a></span>
<span style='color:#000000;font-weight:400;text-
decoration:none;vertical-align:baseline;font-size:11pt;font-
family:"Arial";font-style:normal'>, 
dass eindeutig private Kommunikation von der Überwachung ausgenommen 
sein muss</span></p>

当我尝试使用以下内容从 xmlValues(XML 库)中提取文本时效果很好:

It works well when I just try to extract the text from the xmlValues (XML-library) by using something like:

doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()", xmlValue)

但就我而言,我需要保留 HTML 文件中的链接(-tags),并删除 https://www.google.com/url?q=-部分.所以我想摆脱所有的样式,只保留文本 + 链接标签.

But in my case, I need to retain the links (<a>-tags) in the HTML file, and delete the https://www.google.com/url?q=-part. So I want to get rid of all styling and keep only the text + the link-tags.

我试图通过在 XPath 中使用 //(p | a) 来获取两个节点,但没有成功.

I tried to get both of the nodes by using //(p | a)in the XPath, but it didn't work.

推荐答案

我更喜欢使用 rvest 包而不是 XML.

I prefer to use the rvest package instead of XML.

在这段代码中,我使用 rvest 包来解析 html 并从页面中提取链接.然后使用 stringr 包,我在 ?q= 部分拆分链接文本并返回原始链接的后半部分.

In this code I use the rvest package to parse the html and extract out the links from the page. Then using the stringr package I split the link text at the ?q= part and return the back half of the original link.

library(rvest)
library(stringr)

#Read html file, 
page<-read_html("sample.txt") 

#then find the link nodes, then extract the attribute text (ie the links)
link<-page%>% html_nodes("a") %>% html_attr( "href")
#return second string of first list element 
#(Use sapply if there are more than 1 link in document)
desiredlink<-str_split(link, "\\?q=")[[1]][2]

#Find the text in all of the span nodes
span_text<-page%>% html_nodes("span") %>% html_text()
# or this for the text under the p nodes
p_text<-page%>% html_nodes("p") %>% html_text()

我将上面的示例 html 代码保存到文件中:sample.txt"

I have your sample html code from above saved to the file: "sample.txt"

这篇关于使用 R 中剩余的链接标签将 HTML 解析为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆