使用R,XML包进行Web抓取 - Web浏览器上的路径与R中解析的HTML下载不同 [英] Web scraping with R, XML Package - paths on web browser are different from parsed HTML download in R

查看:141
本文介绍了使用R,XML包进行Web抓取 - Web浏览器上的路径与R中解析的HTML下载不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网上抓取本网站(以葡萄牙语)。



当您使用google chrome时,xpath命令 // div [@ class ='result-ofertas'] // span [@ class ='location'] / a [1] 正确返回出售公寓的街区。您可以使用Chrome的扩展程序 xpath helper 自行尝试。



好的。所以我尝试用R下载网站以自动提取数据,使用 XML 包:

 图书馆(XML)
网站< - http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/ &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP; $ p>

但是当我在R中下载网站时,页面源不再一样。



以前的xpath命令的结果为null:

  xpathApply(html.raw,// div [@ class ='result-ofertas'] / / span [@ class ='location'] / a [1],xmlValue)

你mannualy下载网站到你的电脑,而不是用R下载,上面的xpath工作得很好。



看来R正在下载另一个网页(一个移动网页,它正在下载

我的问题不在于如何提取R正在下载的这个不同页面的信息。我可以用下面的xpath命令来处理:

  xpathApply(html.raw,// p [@ class = 'local'],xmlValue)

但我真的很想明白为什么会发生这种情况。



更具体地说:


  1. 这里发生了什么?

  2. 即使地址相同,为什么是两个不同的网页(Chrome和R)?

  3. 是否有办法强制R下载我在Chrome中看到的确切网页(这很有用,因为我通常使用 xpath helper 来测试xpath命令>扩展)。 尝试在R中设置请求用户代理以匹配您的Chrome用户代理(可以在开发人员工具的网络选项卡上看到该代理,只需选择一个请求并查看标头即可)。


    I am web scraping this website (in portuguese).

    When you are using google chrome, the xpath command //div[@class='result-ofertas']//span[@class='location']/a[1] correctly returns the neighborhood of the apartments for sale. You can try this yourself with Chrome's extension xpath helper.

    Ok. So I try to download the website with R to automate the extraction of the data, with the XML package:

    library(XML)    
    site <- "http://www.zap.com.br/imoveis/sao-paulo+sao-paulo/apartamento-padrao/aluguel/?rn=104123456&pag=1"
    html.raw <- htmlTreeParse(site,useInternalNodes=T, encoding="UTF-8")
    

    But when I download the website in R, the page source is not the same anymore.

    The previous xpath command results in null:

    xpathApply(html.raw, "//div[@class='result-ofertas']//span[@class='location']/a[1]", xmlValue)
    

    But if you mannualy download the website to your computer instead of downloading it with R, the xpath above works just fine.

    It seems that R is downloading another webpage (a "mobile" one, it is downloading this one instead of the correct one), and not the one that it is shown in Chrome.

    My problem is not with how to extract the information of this "different" page that R is downloading. I can actually deal with that with the xpath command below:

    xpathApply(html.raw, "//p[@class='local']", xmlValue)
    

    But I really would like to understand why and how this is happening.

    More specifically:

    1. What is happening here?
    2. Why are the two different webpages (Chrome's and R's), even though the address is the same?
    3. Is there a way to force R to download the exact webpage I see in Chrome (this would be useful, because I usually test the xpath commands with the xpath helper extension).

    解决方案

    The site is most likely redirecting requests based on the user agent. Try setting the request user agent in R to match your Chrome user agent (which can be seen on the network tab of the developer tools. Just select a request and view the headers).

    这篇关于使用R,XML包进行Web抓取 - Web浏览器上的路径与R中解析的HTML下载不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆