抓取 Javascript 生成的数据 [英] Scraping Javascript generated data

查看:44
本文介绍了抓取 Javascript 生成的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与世界银行合作一个项目,分析他们的采购流程.

I'm working on a project with the World Bank analyzing their procurement processes.

世界银行为其每个项目维护网站,其中包含已发布相关合同的链接和数据(示例).采购标签下提供了与合同相关的数据.

The WB maintains websites for each of their projects, containing links and data for the associated contracts issued (example). Contract-related data is available under the procurement tab.

我希望能够从该站点提取项目的合同信息,但是链接和相关数据是使用嵌入式 Javascript 生成的,并且显示合同授予和其他数据的页面的 URL 似乎没有遵循可辨别的模式(示例).

I'd like to be able to pull a project's contract information from this site, but the links and associated data are generated using embedded Javascript, and the URLs of the pages displaying contract awards and other data don't seem to follow a discernable schema (example).

有什么办法可以通过 R 抓取第一个示例中浏览器呈现的数据?

Is there any way I can scrape the browser rendered data in the first example through R?

推荐答案

主页面调用一个javascript函数

The main page calls a javascript function

javascript:callTabContent('p','P090644','','en','procurement','procurementId');

这里的主要内容是项目 ID P090644.这与所需的语言 en 一起作为参数传递给位于 http 的表单://www.worldbank.org/p2e/procurement.html.

The main thing here is the project id P090644. This together with the required language en are passed as parameters to a form at http://www.worldbank.org/p2e/procurement.html.

可以使用 url http://复制此表单调用/www.worldbank.org/p2e/procurement.html?lang=en&projId=P090644.

提取相关项目描述网址的代码如下:

Code to extract relevant project description urls follows:

projID<-"P090644"
projDetails<-paste0("http://www.worldbank.org/p2e/procurement.html?lang=en&projId=",projID)

require(XML)

pdData<-htmlParse(projDetails)
pdDescribtions<-xpathSApply(pdData,'//*/table[@id="contractawards"]//*/@href')

#> pdDescribtions
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005718" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005702" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005709" 
                                                                href 
#"http://search.worldbank.org/wcontractawards/procdetails/OP00005715" 

应该注意的是,这里提供了可能对您有用的 excel 链接.它们可能包含您打算从描述链接中删除的数据

it should be noted that excel links are provided which maybe of use to you also. They may contain the data you intend to scrap from the description links

procNotice<-paste0("http://search.worldbank.org/wprocnotices/projectdetails/",projID,".xls")
conAward<-paste0("http://search.worldbank.org/wcontractawards/projectdetails/",projID,".xls")
conData<-paste0("http://search.worldbank.org/wcontractdata/projectdetails/",projID,".xls")

require(gdata)

pnData<-read.xls(procNotice)
caData<-read.xls(conAward)
cdData<-read.xls(conData)

更新:

要查找发布的内容,我们可以检查调用 javascript 函数时会发生什么.使用 Firebug 或类似的东西,我们拦截了开始的请求头:

To find what is being posted we can examine what happens when the javascript function is called. Using Firebug or something similar we intercept the request header which starts:

POST /p2e/procurement.html HTTP/1.1
Host: www.worldbank.org

并有参数:

lang=en
projId=P090644

或者,我们可以在 http://siteresources 上检查 javascript.worldbank.org/cached/extapps/cver116/p2e/js/script.js 并查看函数callTabContent:

Alternatively we can examine the javascript at http://siteresources.worldbank.org/cached/extapps/cver116/p2e/js/script.js and look at the function callTabContent:

function callTabContent(tabparam, projIdParam, contextPath, langCd, htmlId, anchorTagId) {
    if (tabparam == 'n' || tabparam == 'h') {
        $.ajax( {
            type : "POST",
            url : contextPath + "/p2e/"+htmlId+".html",
            data : "projId=" + projIdParam + "&lang=" + langCd,
            success : function(msg) {
                if(tabparam=="n"){
                    $("#newsfeed").replaceWith(msg);
                } else{
                    $("#cycle").replaceWith(msg);
                }
                stickNotes();
            }
        });
    } else {
        $.ajax( {
            type : "POST",
            url : contextPath + "/p2e/"+htmlId+".html",
            data : "projId=" + projIdParam + "&lang=" + langCd,
            success : function(msg) {
                $("#tabContent").replaceWith(msg);
                $('#map_container').hide();
                changeAlternateColors();
                $("#tab_menu a").removeClass("selected");
                $('#'+anchorTagId).addClass("selected");                
                stickNotes();
            }
        });
    }
}

检查函数的内容我们可以看到它只是将相关参数发布到表单然后更新网页.

examining the content of the function we can see it is simply posting relevant parameters to a form then updating the webpage.

这篇关于抓取 Javascript 生成的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆