R浏览器和GET / getURL之间的差异 [英] R Disparity between browser and GET / getURL

查看:238
本文介绍了R浏览器和GET / getURL之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从网页下载内容,我发现响应数据格式不正确或不完整,如果GET或getURL在加载这些数据之前拉取。

  library(httr)
library(RCurl)
url< - https: .vanguardcanada.ca / individual / etfs / etfs.htm
d1< - GET(url)#这里显示了很多没有填充的{{mustache style}}代码
d2< - getURL (url)#这显示好像没有得到任何东西

如何进行。我的目标是获取与浏览器中显示的链接相关联的数字:

  https://www.vanguardcanada.ca /individual/etfs/etfs-detail-overview.htm?portId=9548 

在这种情况下,我想下载并刮掉'9548'。



不知道为什么getURL和GET似乎得到与浏览器中显示的结果大不相同的结果。似乎数据加载缓慢,几乎就像GET和getURL在完全加载之前拉。



例如,请参阅:

  x < https://www.vanguardcanada.ca/individual/etfs/etfs-detail-prices.htm?portId=9548
readHTMLTable(htmlParse(GET(x)))


解决方案

我认为问题是你可能不明白这个网页是如何工作的。当你调用 GET(url)你会得到实际的 html / text 是该页的源。这是直接从服务器发送的内容。这并不总是在浏览器中显示的内容。这是尤其真实的现在 - 一天是很多页内的内容是后来生成的javascript。这正是在这个页面上发生了什么。页面的内容在该网页的html源中找不到,稍后通过javascript下载。



httr RCurl 将执行所需的JavaScript以填充您实际查看的表。有一个名为 RSelenium 的包,它能够与浏览器交互执行javascript,但在这种情况下,我们实际上可以解决这个问题。



首先,为什么 getURL 没有工作。看来这个web服务器嗅探由请求程序发送的用户代理发送不同的内容。无论RCurl使用的默认用户代理是不是好,足以从服务器获取html。您可以通过指定其他用户代理来解决此问题。例如

  d2<  -  getURL(url,.opts = list(useragent =Mozila 5.0))$ b $  



但是回到第一页的

主要问题。处理这类问题时,我强烈建议您使用Chrome开发人员工具(或任何与您最喜欢的浏览器相同的工具)。在Chrome开发人员工具中,特别是在网络标签上,您可以查看Chrome发出的所有请求,以获取数据





如果点击第一个(etfs.html),请参阅该请求的标头和响应。在响应子标签上,您应该看到与 GET getURL 发现的完全相同的内容。然后我们下载了一堆CSS和javascript文件。最有趣的文件是GetETFJson.js。这实际上似乎以几乎 JSON格式保存大部分数据。它实际上有一些真正的JavaScript在前面的JSON块,这种方式得到了。但我们可以使用

  d3<  -  GET(https://www.vanguardcanada.ca/individual/ mvc / GetETFJson.js)

并使用

  p3 < -  content(d3,as =text)

,然后将其变成一个R对象

 库(jsonlite)$ b $再次,b r3 < -  fromJSON(substr(p3,13,nchar(p3)))

我们使用上面的 substr 去掉非JSON的东西,以便更容易解析。



现在,您可以探索返回的对象。但是它看起来像你想要的数据存储在以下向量

  cbind(r3 $ fundData $ Fund $ profile $ portId, r3 $ fundData $ Fund $ profile $ benchMark)

[,1] [,2]
[1,]9548FTSE加拿大世界加拿大指数
[2,]9561FTSE加拿大全部指数加元
[3,]9554加拿大分割指数
[4,]9559房地产上限25%指数
[5,]9560富时加拿大高股息率指数
[6,]9550富时在CAD发展亚太指数
[ 7,]9549FTSE在CAD中开发欧洲指数
[8,]9558FTSE开发北美CAD指数
[9,]9555北美指数对冲CAD
[10,]9556CAD中的新兴市场指数
[11,]9563S& P 500指数CAD
[12,]9562S& P 500 CAD对冲指数
[13,]9566纳斯达克美国股息成功者选择CAD中的指数
[14,]9564 纳斯达克美国股息成功者选择指数在CAD对冲
[15,]9557CRSP美国总市场指数在加元
[16,]9551对冲美国总市场指数对冲CAD
[17,]9552巴克莱全球总体CAD浮动调整指数在CAD
[18,]9553巴克莱全球总体CAD 1-5年政府/信用浮动Adj 巴克莱全球整合加拿大1 - 5年信用浮动调整指数在CAD
[20,]9568巴克莱全球总体除美元浮动调整RIC上盖指数对冲CAD
[21,]9567巴克莱美国总体浮动调整指数对冲CAD

希望这足以提取您需要的数据,以识别具有更多数据的URL的路径。


I'm trying to download the content from a page and I'm finding that the response data is either malformed or incomplete, as if GET or getURL are pulling before those data are loaded.

library(httr)
library(RCurl)
url <- "https://www.vanguardcanada.ca/individual/etfs/etfs.htm"
d1 <- GET(url) # This shows a lot of {{ moustache style }} code that's not filled
d2 <- getURL(url) # This shows "" as if it didn't get anything

I'm not sure how to proceed. My goal is to get the numbers associated with the links that show in the browser:

https://www.vanguardcanada.ca/individual/etfs/etfs-detail-overview.htm?portId=9548

So in this case, I want to download and scrape '9548'.

Not sure why getURL and GET seem to get wildly different results than what's presented in the browser. It seems like data is loaded slowly and almost as if GET and getURL pull before it's fully loaded.

For example, look at:

x <- "https://www.vanguardcanada.ca/individual/etfs/etfs-detail-prices.htm?portId=9548"
readHTMLTable(htmlParse(GET(x)))

解决方案

I think the problem is you might not understand how this web page works. When you call GET(url) you are getting the actually html/text that is the source of that page. This is what is being sent directly from the server. This isn't always what is exactly displayed in the browser. This is especially true now-a-days were a lot of in-page content is generated later by javascript. That's exactly what's going on in this page. The "content" on the page isn't found in the html source of that page, it is downloaded later via javascript.

Neither httr nor RCurl will execute the javascript required to "fill" the page with the table you are actually viewing. There is a package called RSelenium which is capable of interacting with a browser to execute javascript, but in this case we actually can get around that.

First, just a side note on why getURL didn't work. It seems this web server sniffs the user-agent sent by the requesting program to send different content back. Whatever the default user-agent used by RCurl is isn't deemed "good" enough to get the html from the server. You can get around this by specifying a different user agent. For example

d2 <- getURL(url, .opts=list(useragent="Mozila 5.0"))

seems to work.

But getting back to the main problem. When working on problems like this, i strongly recommend you use the Chrome Developer tools (or whatever the equivalent is in your favorite browser). In the Chrome developer tools, specifically on the Network tab, you can see all requests made by Chrome to get the data

If you click on the first one ("etfs.html") you can see the headers and response for that request. On the response sub-tab, you should see exactly the same content that is found by GET or getURL. Then we download a bunch of CSS and javascript files. The file that looked most interesting was "GetETFJson.js". This actually seems to hold most of the data in an almost JSON like format. It actually has some true javascript in front the JSON block that kind of gets in the way. But we can download that file with

d3 <- GET("https://www.vanguardcanada.ca/individual/mvc/GetETFJson.js")

and extract the content as text with

p3 <- content(d3, as="text")

and then turn it into an R object with

library(jsonlite)
r3 <- fromJSON(substr(p3,13,nchar(p3)))

again, we are using substr above to strip off the non-JSON stuff at the beginning to make it easier to parse.

Now, you can explore the object returned. But it looks like the data you want is stored in the following vectors

cbind(r3$fundData$Fund$profile$portId, r3$fundData$Fund$profile$benchMark)

      [,1]   [,2]                                                                            
 [1,] "9548" "FTSE All World ex Canada Index in CAD"                                         
 [2,] "9561" "FTSE Canada All Cap Index in CAD"                                              
 [3,] "9554" "Spliced Canada Index"                                                          
 [4,] "9559" "FTSE Canada All Cap Real Estate Capped 25% Index"                              
 [5,] "9560" "FTSE Canada High Dividend Yield Index"                                         
 [6,] "9550" "FTSE Developed Asia Pacific Index in CAD"                                      
 [7,] "9549" "FTSE Developed Europe Index in CAD"                                            
 [8,] "9558" "FTSE Developed ex North America Index in CAD"                                  
 [9,] "9555" "Spliced FTSE Developed ex North America Index Hedged in CAD"                   
[10,] "9556" "Spliced Emerging Markets Index in CAD"                                         
[11,] "9563" "S&P 500 Index in CAD"                                                          
[12,] "9562" "S&P 500 Index in CAD Hedged"                                                   
[13,] "9566" "NASDAQ US Dividend Achievers Select Index in CAD"                              
[14,] "9564" "NASDAQ US Dividend Achievers Select Index Hedged in CAD"                       
[15,] "9557" "CRSP US Total Market Index in CAD"                                             
[16,] "9551" "Spliced US Total Market Index Hedged in CAD"                                   
[17,] "9552" "Barclays Global Aggregate CAD Float Adjusted Index in CAD"                     
[18,] "9553" "Barclays Global Aggregate CAD 1-5 Year Govt/Credit Float Adj Ix in CAD"        
[19,] "9565" "Barclays Global Aggregate Canadian 1-5 Year Credit Float Adjusted Index in CAD"
[20,] "9568" "Barclays Global Aggregate ex-USD Float Adjusted RIC Capped Index Hedged in CAD"
[21,] "9567" "Barclays U.S. Aggregate Float Adjusted Index Hedged in CAD"  

So hopefully that should be sufficient to extract the data you need to identify the path to the URL with more data.

这篇关于R浏览器和GET / getURL之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆