使用R,XML包进行Web抓取 - Web浏览器上的路径与R中解析的HTML下载不同 [英] Web scraping with R, XML Package - paths on web browser are different from parsed HTML download in R
问题描述
我在网上抓取本网站(以葡萄牙语)。
当您使用google chrome时,xpath命令 // div [@ class ='result-ofertas'] // span [@ class ='location'] / a [1]
正确返回出售公寓的街区。您可以使用Chrome的扩展程序 xpath helper 自行尝试。
好的。所以我尝试用R下载网站以自动提取数据,使用 XML
包:
图书馆(XML)
网站< - http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/ &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP; $ p>
但是当我在R中下载网站时,页面源不再一样。
以前的xpath命令的结果为null:
xpathApply(html.raw,// div [@ class ='result-ofertas'] / / span [@ class ='location'] / a [1],xmlValue)
你mannualy下载网站到你的电脑,而不是用R下载,上面的xpath工作得很好。
I am web scraping this website (in portuguese).
When you are using google chrome, the xpath command //div[@class='result-ofertas']//span[@class='location']/a[1]
correctly returns the neighborhood of the apartments for sale. You can try this yourself with Chrome's extension xpath helper.
Ok. So I try to download the website with R to automate the extraction of the data, with the XML
package:
library(XML)
site <- "http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/?rn=104123456&pag=1"
html.raw <- htmlTreeParse(site,useInternalNodes=T, encoding="UTF-8")
But when I download the website in R, the page source is not the same anymore.
The previous xpath command results in null:
xpathApply(html.raw, "//div[@class='result-ofertas']//span[@class='location']/a[1]", xmlValue)
But if you mannualy download the website to your computer instead of downloading it with R, the xpath above works just fine.
It seems that R is downloading another webpage (a "mobile" one, it is downloading this one instead of the correct one), and not the one that it is shown in Chrome.
My problem is not with how to extract the information of this "different" page that R is downloading. I can actually deal with that with the xpath command below:
xpathApply(html.raw, "//p[@class='local']", xmlValue)
But I really would like to understand why and how this is happening.
More specifically:
- What is happening here?
- Why are the two different webpages (Chrome's and R's), even though the address is the same?
- Is there a way to force R to download the exact webpage I see in Chrome (this would be useful, because I usually test the xpath commands with the xpath helper extension).
解决方案 The site is most likely redirecting requests based on the user agent. Try setting the request user agent in R to match your Chrome user agent (which can be seen on the network tab of the developer tools. Just select a request and view the headers).
这篇关于使用R,XML包进行Web抓取 - Web浏览器上的路径与R中解析的HTML下载不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!