rvest包的read_html()函数在& quot;& lt;& quot;处停止读取象征 [英] rvest package read_html() function stops reading at "<" symbol
问题描述
我想知道在 rvest
包中是否有此行为.当 rvest
看到<
字符时,它将停止读取HTML.
I was wondering if this behavior is intentional in the rvest
package. When rvest
sees the <
character it stops reading the HTML.
library(rvest)
read_html("<html><title>under 30 years = < 30 years <title></html>")
打印:
[1] <head>\n <title>under 30 = </title>\n</head>
如果这是故意的,是否有解决方法?
If this is intentional, is there a workaround?
推荐答案
是的,对于 rvest
来说是正常的,因为对于html而言是正常的.
Yes, it is normal for rvest
because it's normal for html.
请参见 w3schools HTML实体页面.<
和>
是html中的保留字符,它们的字面值必须用另一种方式书写,作为特定的字符实体.这是链接页面中的实体表,其中提供了一些常用的html字符及其各自的html实体.
See the w3schools HTML Entities page. <
and >
are reserved characters in html and their literal values have to be written another way, as specific character entities. Here is the entity table from the linked page, giving some commonly used html characters and their respective html entities.
XML::readHTMLTable("http://www.w3schools.com/html/html_entities.asp", which = 2)
# Result Description Entity Name Entity Number
# 1 non-breaking space  
# 2 < less than < <
# 3 > greater than > >
# 4 & ampersand & &
# 5 ¢ cent ¢ ¢
# 6 £ pound £ £
# 7 ¥ yen ¥ ¥
# 8 € euro € €
# 9 © copyright © ©
# 10 ® registered trademark ® ®
因此,您可能必须用 gsub()
替换这些值,或者如果没有太多,则手动替换.您可以看到,当这些字符替换为正确的实体时,它将可以正确解析.
So you will have to replace those values, perhaps with gsub()
or manually if there aren't too many. You can see that it will parse properly when those characters are replaced with the correct entity.
library(XML)
doc <- htmlParse("<html><title>under 30 years = < 30 years </title></html>")
xmlValue(doc["//title"][[1]])
# [1] "under 30 years = < 30 years "
您可以使用 gsub()
,如下所示
txt <- "<html><title>under 30 years = < 30 years </title></html>"
xmlValue(htmlParse(gsub(" < ", " < ", txt, fixed = TRUE))["//title"][[1]])
# [1] "under 30 years = < 30 years "
我在这里使用了 XML 包,但其他处理html的包也是如此.
I used the XML package here, but the same applies for other packages that process html.
这篇关于rvest包的read_html()函数在& quot;& lt;& quot;处停止读取象征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!