继“下一个”之后,使用rvest链接相对路径 [英] Following "next" link with relative paths using rvest

查看:107
本文介绍了继“下一个”之后,使用rvest链接相对路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 rvest 包从页面 http: //www.radiolab.org/series/podcasts 的。刮掉第一页后,我想按照底部的下一步链接,刮掉第二页,移到第三页等。

以下行给出一个错误:

  html_session(http://www.radiolab.org/series/podcasts)%>% follow_link(Next)
##导航到
##
## ./2/
## parseURI错误(u):无法解析URI
##
## ./2/

检查HTML显示有一些额外的东西围绕./2/这个 rvest 显然不喜欢:

  html(http://www.radiolab.org/series/podcasts)%>%html_node(。pagefooter-next a)
##< a href =&# 10;&#10; ./2/>下一个< / a>

.Last.value%>%html_attrs()
## href
##\\\
\\\
./2/

问题1:
如何获得 rvest :: follow_link 像我的浏览器一样正确处理这个链接? (我可以手动获取Next链接并用正则表达式清理它,但更喜欢利用 rvest 提供的自动化。)

$ b $在$ follow_link 代码的末尾,它调用 jump_to $ b


/ code>。所以我尝试了以下方法:

  html_session(http://www.radiolab.org/series/podcasts)%> ;%jump_to(./ 2 /)
##< session> http://www.radiolab.org/series/2/
##状态:404
##类型:text / html; charset = utf-8
##大小:10744
##警告消息:
##在request_GET(x,url,...)中:客户端错误:(404)未找到

挖掘代码,看起来问题在于 XML :: getRelativeURL ,它使用 dirname 去掉原始路径的最后部分(/ podcasts):

  XML :: getRelativeURL(./ 2 /,http://www.radiolab.org/series/podcasts/)
## [1]http://www.radiolab.org/series/./2

XML :: getRelativeURL(../ 3 /,http://www.radiolab。 org / series / podcasts / 2 /)
## [1]http://www.radiolab.org/series/3

问题2:
如何获得 rvest :: jump_to XML :: getRelativeURL 来正确处理相对路径?

解决方案

似乎仍然会出现在RadioLab.com上,您最好的解决方案是创建一个自定义函数来处理这个边界情况。如果你只是担心这个网站 - 这个特定的错误 - 那么你可以写这样的东西:

  library(rvest )

follow_next< - function(session,text =Next,...){
link< - html_node(session,xpath = sprintf(// * [text ()[包含(。,'%s')]],text))
url< - html_attr(link,href)
url = trimws(url)
url = gsub(^ \\。{1} /,,url)
message(导航到,url)
jump_to(session,url,...)

$ / code>

这可以让你编写这样的代码:

  html_session(http://www.radiolab.org/series/podcasts)%>%
follow_next()

#>浏览至2 /
#> <会话> http://www.radiolab.org/series/podcasts/2/
#>状态:200
#>类型:text / html; charset = utf-8
#>大小:61261

这不是一个错误 - RadioLab上的URL格式不正确,解析格式不正确的URL不是一个错误。如果你想在处理问题时保持自由,你需要手动解决它。



请注意,您也可以使用 RSelenium 来启动一个实际的浏览器(例如Chrome浏览器),并为您执行URL解析。

I am using the rvest package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.

The following line gives an error:

html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to 
##     
##       ./2/  
## Error in parseURI(u) : cannot parse URI 
##     
##       ./2/  

Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest apparently doesn't like:

html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## <a href="&#10;    &#10;      ./2/  ">Next</a> 

.Last.value %>% html_attrs()
##                   href 
## "\n    \n      ./2/  "

Question 1: How can I get rvest::follow_link to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest.)


At the end of the follow_link code, it calls jump_to. So I tried the following:

html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## <session> http://www.radiolab.org/series/2/
##   Status: 404
##   Type:   text/html; charset=utf-8
##   Size:   10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found

Digging into the code, it looks like the issue is with XML::getRelativeURL, which uses dirname to strip off the last part of the original path ("/podcasts"):

XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"

XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"

Question 2: How can I get rvest::jump_to and XML::getRelativeURL to correctly handle relative paths?

解决方案

Since this problem still seems to occur with RadioLab.com, your best solution is to create a custom function to handle this edge case. If you're only worried about this site - and this particular error - then you can write something like this:

library(rvest)

follow_next <- function(session, text ="Next", ...) {
    link <- html_node(session, xpath = sprintf("//*[text()[contains(.,'%s')]]", text))
    url <- html_attr(link, "href")
    url = trimws(url)
    url = gsub("^\\.{1}/", "", url)
    message("Navigating to ", url)
    jump_to(session, url, ...)
}

That would allow you to write code like this:

html_session("http://www.radiolab.org/series/podcasts") %>%
    follow_next()

#> Navigating to 2/
#> <session> http://www.radiolab.org/series/podcasts/2/
#>   Status: 200
#>   Type:   text/html; charset=utf-8
#>   Size:   61261

This is not per se an error - the URL on RadioLab is malformed, and failing to parse a malformed URL is not a bug. If you want to be liberal in how you handle the issue you need to manually work around it.

Note that you could also use RSelenium to launch an actual browser (e.g. Chrome) and have that perform the URL parsing for you.

这篇关于继“下一个”之后,使用rvest链接相对路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆