R 浏览器和 GET/getURL 之间的差异 [英] R Disparity between browser and GET / getURL

查看:27
本文介绍了R 浏览器和 GET/getURL 之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从页面下载内容,但我发现响应数据格式不正确或不完整,就好像 GET 或 getURL 在加载这些数据之前正在拉取一样.

I'm trying to download the content from a page and I'm finding that the response data is either malformed or incomplete, as if GET or getURL are pulling before those data are loaded.

library(httr)
library(RCurl)
url <- "https://www.vanguardcanada.ca/individual/etfs/etfs.htm"
d1 <- GET(url) # This shows a lot of {{ moustache style }} code that's not filled
d2 <- getURL(url) # This shows "" as if it didn't get anything

我不确定如何继续.我的目标是获取与浏览器中显示的链接相关联的数字:

I'm not sure how to proceed. My goal is to get the numbers associated with the links that show in the browser:

https://www.vanguardcanada.ca/individual/etfs/etfs-detail-overview.htm?portId=9548

所以在这种情况下,我想下载并抓取9548".

So in this case, I want to download and scrape '9548'.

不知道为什么 getURL 和 GET 得到的结果似乎与浏览器中显示的结果截然不同.似乎数据加载缓慢,几乎就像 GET 和 getURL 在完全加载之前拉取一样.

Not sure why getURL and GET seem to get wildly different results than what's presented in the browser. It seems like data is loaded slowly and almost as if GET and getURL pull before it's fully loaded.

例如看:

x <- "https://www.vanguardcanada.ca/individual/etfs/etfs-detail-prices.htm?portId=9548"
readHTMLTable(htmlParse(GET(x)))

推荐答案

重要的是要了解,当您抓取网页时,您正在获取该页面的原始 HTML 源代码;这不一定完全是您将在 Web 浏览器中与之交互的内容.当您调用 GET(url) 时,您将获得作为该页面源的实际 html/text.这是直接从服务器发送的内容.现在大多数网页还假设浏览器不仅会显示 HMTL,还会在该页面上执行 javascript 代码.当 JavaScript 稍后生成大量页内内容时尤其如此.这正是本页中发生的事情.在该页面的 html 源代码中找不到该页面上的内容";稍后通过 javascript 下载.

It's important to understand that when you scrape a webpage, you are getting the raw HTML source code for that page; this isn't necessarily exactly what you will be interacting with in a web browser. When you call GET(url) you are getting the actual html/text that is the source of that page. This is what is being sent directly from the server. Nowadays most web pages also assume the browser will not only display the HMTL, but will also execute the javascript code on that page. This is especially true when a lot of in-page content is generated later by javascript. That's exactly what's going on in this page. The "content" on the page isn't found in the html source of that page; it is downloaded later via javascript.

httrRCurl 都不会执行用您实际查看的表格填充"页面所需的 javascript.有一个名为 RSelenium 的包,它能够与浏览器交互以执行 javascript,但在这种情况下,我们实际上可以解决这个问题.

Neither httr nor RCurl will execute the javascript required to "fill" the page with the table you are actually viewing. There is a package called RSelenium which is capable of interacting with a browser to execute javascript, but in this case we actually can get around that.

首先,只是关于为什么 getURL 不起作用的旁注.似乎这个 Web 服务器会嗅探请求程序发送的用户代理以发送回不同的内容.无论 RCurl 使用的默认用户代理是什么,都不会被认为足够好"以从服务器获取 html.您可以通过指定不同的用户代理来解决这个问题.例如

First, just a side note on why getURL didn't work. It seems this web server sniffs the user-agent sent by the requesting program to send different content back. Whatever the default user-agent used by RCurl is isn't deemed "good" enough to get the html from the server. You can get around this by specifying a different user agent. For example

d2 <- getURL(url, .opts=list(useragent="Mozila 5.0"))

似乎有效.

但是回到主要问题.在处理此类问题时,我强烈建议您使用 Chrome 开发人员工具(或您喜欢的浏览器中的任何等效工具).在 Chrome 开发者工具中,特别是在网络选项卡上,您可以看到 Chrome 为获取数据而发出的所有请求

But getting back to the main problem. When working on problems like this, i strongly recommend you use the Chrome Developer tools (or whatever the equivalent is in your favorite browser). In the Chrome developer tools, specifically on the Network tab, you can see all requests made by Chrome to get the data

如果您点击第一个(etfs.html"),您可以看到该请求的标头和响应.在响应子选项卡上,您应该看到与 GETgetURL 找到的完全相同的内容.然后我们下载一堆 CSS 和 javascript 文件.看起来最有趣的文件是GetETFJson.js".这实际上似乎以几乎 JSON 格式保存大部分数据.它实际上在 JSON 块前面有一些真正的 javascript,这有点碍手碍脚.但是我们可以用

If you click on the first one ("etfs.html") you can see the headers and response for that request. On the response sub-tab, you should see exactly the same content that is found by GET or getURL. Then we download a bunch of CSS and javascript files. The file that looked most interesting was "GetETFJson.js". This actually seems to hold most of the data in an almost JSON like format. It actually has some true javascript in front the JSON block that kind of gets in the way. But we can download that file with

d3 <- GET("https://www.vanguardcanada.ca/individual/mvc/GetETFJson.js")

并使用

p3 <- content(d3, as="text")

然后用

library(jsonlite)
r3 <- fromJSON(substr(p3,13,nchar(p3)))

再一次,我们使用上面的 substr 去除一开始的非 JSON 内容,使其更易于解析.

again, we are using substr above to strip off the non-JSON stuff at the beginning to make it easier to parse.

现在,您可以探索返回的对象.但看起来你想要的数据存储在以下向量中

Now, you can explore the object returned. But it looks like the data you want is stored in the following vectors

cbind(r3$fundData$Fund$profile$portId, r3$fundData$Fund$profile$benchMark)

      [,1]   [,2]                                                                            
 [1,] "9548" "FTSE All World ex Canada Index in CAD"                                         
 [2,] "9561" "FTSE Canada All Cap Index in CAD"                                              
 [3,] "9554" "Spliced Canada Index"                                                          
 [4,] "9559" "FTSE Canada All Cap Real Estate Capped 25% Index"                              
 [5,] "9560" "FTSE Canada High Dividend Yield Index"                                         
 [6,] "9550" "FTSE Developed Asia Pacific Index in CAD"                                      
 [7,] "9549" "FTSE Developed Europe Index in CAD"                                            
 [8,] "9558" "FTSE Developed ex North America Index in CAD"                                  
 [9,] "9555" "Spliced FTSE Developed ex North America Index Hedged in CAD"                   
[10,] "9556" "Spliced Emerging Markets Index in CAD"                                         
[11,] "9563" "S&P 500 Index in CAD"                                                          
[12,] "9562" "S&P 500 Index in CAD Hedged"                                                   
[13,] "9566" "NASDAQ US Dividend Achievers Select Index in CAD"                              
[14,] "9564" "NASDAQ US Dividend Achievers Select Index Hedged in CAD"                       
[15,] "9557" "CRSP US Total Market Index in CAD"                                             
[16,] "9551" "Spliced US Total Market Index Hedged in CAD"                                   
[17,] "9552" "Barclays Global Aggregate CAD Float Adjusted Index in CAD"                     
[18,] "9553" "Barclays Global Aggregate CAD 1-5 Year Govt/Credit Float Adj Ix in CAD"        
[19,] "9565" "Barclays Global Aggregate Canadian 1-5 Year Credit Float Adjusted Index in CAD"
[20,] "9568" "Barclays Global Aggregate ex-USD Float Adjusted RIC Capped Index Hedged in CAD"
[21,] "9567" "Barclays U.S. Aggregate Float Adjusted Index Hedged in CAD"  

所以希望这足以提取您需要的数据,以识别具有更多数据的 URL 的路径.

So hopefully that should be sufficient to extract the data you need to identify the path to the URL with more data.

这篇关于R 浏览器和 GET/getURL 之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆