read_html(url)和read_html(content(GET(url),"text"))之间的区别 [英] Difference between read_html(url) and read_html(content(GET(url), "text"))

查看:461
本文介绍了read_html(url)和read_html(content(GET(url),"text"))之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个很好的答案: https://stackoverflow.com/a/58211397/3502164 .

解决方案的开始包括:

library(httr)
library(xml2)

gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text"))

xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

在多个请求中输出是恒定的:

"59243d3a2....61f8f73136118f9"

到目前为止,我的默认方式是:

doc <- read_html("https://nzffdms.niwa.co.nz/search")
xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

结果与上面的输出不同,并且在多个请求中发生变化.

问题:

两者之间有什么区别

  • read_html(url)
  • read_html(content(GET(url), "text"))

为什么会导致不同的值?为什么只有"GET"解决方案会在链接的问题中返回csv?

(我希望可以用三个子问题来构成它).

我尝试过的事情:

深入研究函数调用的Rabbit漏洞:

read_html
(ms <- methods("read_html"))
getAnywhere(ms[1])
xml2:::read_html
xml2:::read_html.default
#xml2:::read_html.response

read_xml
(ms <- methods("read_xml"))
getAnywhere(ms[1])

但这导致了以下问题:找到用于R包装函数的使用方法

想法:

  • 我看不到get请求带有任何标头或Cookies, 可以解释不同的回应.

  • 据我了解,read_htmlread_html(content(GET(.), "text"))都将返回XML/html.

  • 好吧,在这里我不确定是否应该检查,但是因为我用完了主意:我检查了是否存在某种缓存.

代码:

with_verbose(GET("https://nzffdms.niwa.co.nz/search"))
....
<- Expires: Thu, 19 Nov 1981 08:52:00 GMT
<- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

->在我看来,缓存可能不是解决方案.

  • 查看help("GET")给出了有关条件GET"的有趣部分:

如果GET方法的语义更改为条件GET", 请求消息包括If-Modified-Since,If-Unmodified-Since, If-Match,If-None-Match或If-Range标头字段.有条件的GET 方法要求仅在 条件标头字段描述的情况.这 有条件的GET方法旨在减少不必要的网络使用 通过允许刷新缓存的实体而无需多个 请求或传输客户端已经拥有的数据.

但是据我对with_verbose()的了解,没有设置If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range.

解决方案

区别在于,重复调用httr::GET,句柄在两次调用之间仍然存在.使用xml2::read_html(),每次都建立一个新的连接.

来自httr文档:

句柄池用于为相同的方案/主机/端口组合自动重用Curl句柄.这样可以确保http会话自动重用,并且Cookie可以在对站点的所有请求中得到维护,而无需用户干预.

在xml2文档中,讨论了传递给read_html()的字符串参数:

字符串可以是路径,URL或文字xml.使用base::url或(如果已安装)curl::curl

,可以将Urls转换为连接

所以您的答案是read_html(GET(url))就像刷新浏览器,但是read_html(url)就像关闭浏览器并打开新的浏览器.服务器在其交付的页面上提供唯一的会话ID.新会话,新ID.您可以通过调用httr::reset_handle(url)来证明这一点:

library(httr)
library(xml2)

# GET the page (note xml2 handles httr responses directly, don't need content("text"))
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# A new GET using the same handle gets exactly the same response
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# Now call GET again after resetting the handle
httr::handle_reset("https://nzffdms.niwa.co.nz/search")
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

就我而言,采购上述代码可以使我:

[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "d953ce7acc985adbf25eceb89841c713"

I am looking at this great answer: https://stackoverflow.com/a/58211397/3502164.

The beginning of the solution includes:

library(httr)
library(xml2)

gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text"))

xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

Output is constant across multiple requests:

"59243d3a2....61f8f73136118f9"

My Default way so far would have been:

doc <- read_html("https://nzffdms.niwa.co.nz/search")
xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

That results differs to the Output above and changes across multiple requests.

Question:

What is the difference in between:

  • read_html(url)
  • read_html(content(GET(url), "text"))

Why does it result in different values and why does only the "GET" solution Returns the csv in the linked question?

(I hope its ok to structure it in Kind of three Sub Questions).

What i tried:

Going down the Rabbit hole of function calls:

read_html
(ms <- methods("read_html"))
getAnywhere(ms[1])
xml2:::read_html
xml2:::read_html.default
#xml2:::read_html.response

read_xml
(ms <- methods("read_xml"))
getAnywhere(ms[1])

But that resulted in this Question: Find the used method for R wrapper functions

Thoughts:

  • I dont see that the get request takes any headers or Cookies, that could explain different Responses.

  • From my understanding both read_html and read_html(content(GET(.), "text")) will return XML/html.

  • Ok, here i am not sure if it makes sense to check, but because i ran out of ideas: I checked if there is some Kind of Caching going on.

Code:

with_verbose(GET("https://nzffdms.niwa.co.nz/search"))
....
<- Expires: Thu, 19 Nov 1981 08:52:00 GMT
<- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

--> Does not look to me like Caching might be the solution.

  • Looking at help("GET") gives an interesting section concerning a "conditional GET":

The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.

But as far as i see with with_verbose() None of If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range are set.

解决方案

The difference is that with repeated calls to httr::GET, the handle persists between calls. With xml2::read_html(), a new connection is made each time.

From the httr documentation:

The handle pool is used to automatically reuse Curl handles for the same scheme/host/port combination. This ensures that the http session is automatically reused, and cookies are maintained across requests to a site without user intervention.

From the xml2 documentation, discussing the string parameter that is passed to read_html():

A string can be either a path, a url or literal xml. Urls will be converted into connections either using base::url or, if installed, curl::curl

So your answer is read_html(GET(url)) is like refreshing your browser, but read_html(url) is like closing your browser and opening a new one. The server gives a unique session ID on the page it delivers. New session, new ID. You can prove this by calling httr::reset_handle(url):

library(httr)
library(xml2)

# GET the page (note xml2 handles httr responses directly, don't need content("text"))
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# A new GET using the same handle gets exactly the same response
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# Now call GET again after resetting the handle
httr::handle_reset("https://nzffdms.niwa.co.nz/search")
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

In my case, sourcing the above code gives me:

[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "d953ce7acc985adbf25eceb89841c713"

这篇关于read_html(url)和read_html(content(GET(url),"text"))之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆