rvest包的read_html()函数在& quot;& lt;& quot;处停止读取象征 [英] rvest package read_html() function stops reading at "<" symbol

查看:40
本文介绍了rvest包的read_html()函数在& quot;& lt;& quot;处停止读取象征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在 rvest 包中是否有此行为.当 rvest 看到< 字符时,它将停止读取HTML.

I was wondering if this behavior is intentional in the rvest package. When rvest sees the < character it stops reading the HTML.

library(rvest)
read_html("<html><title>under 30 years = < 30 years <title></html>")

打印:

[1] <head>\n  <title>under 30 = </title>\n</head>

如果这是故意的,是否有解决方法?

If this is intentional, is there a workaround?

推荐答案

是的,对于 rvest 来说是正常的,因为对于html而言是正常的.

Yes, it is normal for rvest because it's normal for html.

请参见 w3schools HTML实体页面.< > 是html中的保留字符,它们的字面值必须用另一种方式书写,作为特定的字符实体.这是链接页面中的实体表,其中提供了一些常用的html字符及其各自的html实体.

See the w3schools HTML Entities page. < and > are reserved characters in html and their literal values have to be written another way, as specific character entities. Here is the entity table from the linked page, giving some commonly used html characters and their respective html entities.

XML::readHTMLTable("http://www.w3schools.com/html/html_entities.asp", which = 2)
#    Result          Description Entity Name Entity Number
# 1           non-breaking space      &nbsp;        &#160;
# 2       <            less than        &lt;         &#60;
# 3       >         greater than        &gt;         &#62;
# 4       &            ampersand       &amp;         &#38;
# 5       ¢                 cent      &cent;        &#162;
# 6       £                pound     &pound;        &#163;
# 7       ¥                  yen       &yen;        &#165;
# 8       €                 euro      &euro;       &#8364;
# 9       ©            copyright      &copy;        &#169;
# 10      ® registered trademark       &reg;        &#174;

因此,您可能必须用 gsub()替换这些值,或者如果没有太多,则手动替换.您可以看到,当这些字符替换为正确的实体时,它将可以正确解析.

So you will have to replace those values, perhaps with gsub() or manually if there aren't too many. You can see that it will parse properly when those characters are replaced with the correct entity.

library(XML)
doc <- htmlParse("<html><title>under 30 years = &lt; 30 years </title></html>")
xmlValue(doc["//title"][[1]])
# [1] "under 30 years = < 30 years "

您可以使用 gsub(),如下所示

txt <- "<html><title>under 30 years = < 30 years </title></html>"
xmlValue(htmlParse(gsub(" < ", " &lt; ", txt, fixed = TRUE))["//title"][[1]])
# [1] "under 30 years = < 30 years "

我在这里使用了 XML 包,但其他处理html的包也是如此.

I used the XML package here, but the same applies for other packages that process html.

这篇关于rvest包的read_html()函数在&amp; quot;&amp; lt;&amp; quot;处停止读取象征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆