如何自动检索 URL AJAX 调用? [英] How to automatically retrieve URL AJAX calls to?

查看:26
本文介绍了如何自动检索 URL AJAX 调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标是编写一个爬虫程序,使其能够:

The aim is to programme a crawlspider able to:

1) 检索此页面表中链接的 URL:http://cordis.europa.eu/fp7/security/projects_en.html

1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en.html

2) 按照来自所有这些 URL 的 AJAX 调用找出包含我想要抓取的数据的最终(AJAX")URL

2) Follow the AJAX call from all those URLs to find out the final ("AJAX")URLs containing the data that I want to scrape

3) 抓取由 AJAX URL 标识的最终页面.

3)Scrape the final pages identified by the AJAX URLs.

到目前为止,我已经在 Scrapy 下写了两个蜘蛛:

So far, I have written two spiders under Scrapy:

1) 第一个从起始页上的链接中检索 URL.代码如下:

1) The first one retrieves the URL from the links on the start page. Here is the code:

   from scrapy.spider import Spider
   from scrapy.selector import HtmlXPathSelector
   from cordis.items import CordisItem

   class MySpider(Spider):
       name = "Cordis1"
       allowed_domains = ["cordis.europa.eu"]
       start_urls = ["http://cordis.europa.eu/fp7/security/projects_en.html"]

       def parse(self, response):
           hxs = HtmlXPathSelector(response)
           titles = hxs.select("//p")
           items = []
           for titles in titles:
               item = CordisItem()
               item ["link"] = titles.select("//ul/li/span/a/@href").extract()
           return item

2) 第二个从AJAX"URL 中抓取数据.代码如下:

2) The second one scrapes the data off of the "AJAX" URLs. Here is the code:

from scrapy.spider import Spider
from scrapy.selector import Selector

class EssaiSpider(Spider):
    name = "aze"
    allowed_domains = ["cordis.europa.eu"]
    start_urls = ["http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=95607",
    "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=93528"]

    def parse(self, response):
        sel = Selector(response)
        acronym = sel.xpath("//*[@class='projttl']/h1/text()").extract()
        short_desc = sel.xpath("//*[@class='projttl']/h2/text()").extract()
        start = sel.xpath("//*[@class='projdates']/b[1]/following::text()[1]").extract()
        end = sel.xpath("//*[@class='projdates']/b[2]/following::text()[1]").extract()
        long_desc = sel.xpath("//*[@class='tech']/p/text()").extract()
        cost = sel.xpath("//*[@class='box-left']/b[3]/following::text()[1]").extract()
        contrib = sel.xpath("//*[@class='box-left']/b[4]/following::text()[1]").extract()
        type = sel.xpath("//*[@class='box-right']/p[3]/br/following::text()[1]").extract()
        sujet = sel.xpath("//*[@id='subjects']/h2/following::text()[1]").extract()
        coord = sel.xpath("//*[@class='projcoord']/div[1]/div[1]/text()").extract()
        coord_nat = sel.xpath("//*[@class='projcoord']/div[1]/div[2]/text()").extract()
        part = sel.xpath("//*[@class='participants']")
        for part in part:
            part1 = sel.xpath("//*[@id='part1']/div[1]/div[1]/text()").extract()
            part1_nat = sel.xpath("//*[@id='part1']/div[1]/div[2]/text()").extract()
            part2 = sel.xpath("//*[@id='part2']/div[1]/div[1]/text()").extract()
            part2_nat = sel.xpath("//*[@id='part2']/div[1]/div[2]/text()").extract()
            part3 = sel.xpath("//*[@id='part3']/div[1]/div[1]/text()").extract()
            part3_nat = sel.xpath("//*[@id='part3']/div[1]/div[2]/text()").extract()
            part4 = sel.xpath("//*[@id='part4']/div[1]/div[1]/text()").extract()
            part4_nat = sel.xpath("//*[@id='part4']/div[1]/div[2]/text()").extract()            
            part5 = sel.xpath("//*[@id='part5']/div[1]/div[1]/text()").extract()
            part5_nat = sel.xpath("//*[@id='part5']/div[1]/div[2]/text()").extract()            
            part6 = sel.xpath("//*[@id='part6']/div[1]/div[1]/text()").extract()
            part6_nat = sel.xpath("//*[@id='part6']/div[1]/div[2]/text()").extract()            
            part7 = sel.xpath("//*[@id='part7']/div[1]/div[1]/text()").extract()
            part7_nat = sel.xpath("//*[@id='part7']/div[1]/div[2]/text()").extract()            
            part8 = sel.xpath("//*[@id='part8']/div[1]/div[1]/text()").extract()
            part8_nat = sel.xpath("//*[@id='part8']/div[1]/div[2]/text()").extract()        
            part9 = sel.xpath("//*[@id='part9']/div[1]/div[1]/text()").extract()
            part9_nat = sel.xpath("//*[@id='part9']/div[1]/div[2]/text()").extract()            
            part10 = sel.xpath("//*[@id='part10']/div[1]/div[1]/text()").extract()
            part10_nat = sel.xpath("//*[@id='part10']/div[1]/div[2]/text()").extract()          
            part11 = sel.xpath("//*[@id='part11']/div[1]/div[1]/text()").extract()
            part11_nat = sel.xpath("//*[@id='part11']/div[1]/div[2]/text()").extract()          
            part12 = sel.xpath("//*[@id='part11']/div[1]/div[1]/text()").extract()
            part12_nat = sel.xpath("//*[@id='part11']/div[1]/div[2]/text()").extract()          
            part13 = sel.xpath("//*[@id='part13']/div[1]/div[1]/text()").extract()
            part13_nat = sel.xpath("//*[@id='part13']/div[1]/div[2]/text()").extract()          
            part13 = sel.xpath("//*[@id='part13']/div[1]/div[1]/text()").extract()
            part13_nat = sel.xpath("//*[@id='part13']/div[1]/div[2]/text()").extract()          
            part14 = sel.xpath("//*[@id='part14']/div[1]/div[1]/text()").extract()
            part14_nat = sel.xpath("//*[@id='part14']/div[1]/div[2]/text()").extract()          
            part15 = sel.xpath("//*[@id='part15']/div[1]/div[1]/text()").extract()
            part15_nat = sel.xpath("//*[@id='part15']/div[1]/div[2]/text()").extract()          
            part16 = sel.xpath("//*[@id='part16']/div[1]/div[1]/text()").extract()
            part16_nat = sel.xpath("//*[@id='part16']/div[1]/div[2]/text()").extract()          
            part17 = sel.xpath("//*[@id='part17']/div[1]/div[1]/text()").extract()
            part17_nat = sel.xpath("//*[@id='part17']/div[1]/div[2]/text()").extract()          
            part18 = sel.xpath("//*[@id='part18']/div[1]/div[1]/text()").extract()
            part18_nat = sel.xpath("//*[@id='part18']/div[1]/div[2]/text()").extract()      
            part19 = sel.xpath("//*[@id='part19']/div[1]/div[1]/text()").extract()
            part2_nat = sel.xpath("//*[@id='part19']/div[1]/div[2]/text()").extract()       
            part20 = sel.xpath("//*[@id='part20']/div[1]/div[1]/text()").extract()
            part20_nat = sel.xpath("//*[@id='part20']/div[1]/div[2]/text()").extract()          
            part21 = sel.xpath("//*[@id='part21']/div[1]/div[1]/text()").extract()
            part21_nat = sel.xpath("//*[@id='part21']/div[1]/div[2]/text()").extract()          
            part22 = sel.xpath("//*[@id='part22']/div[1]/div[1]/text()").extract()
            part22_nat = sel.xpath("//*[@id='part22']/div[1]/div[2]/text()").extract()          
            part23 = sel.xpath("//*[@id='part23']/div[1]/div[1]/text()").extract()
            part23_nat = sel.xpath("//*[@id='part23']/div[1]/div[2]/text()").extract()  
            part24 = sel.xpath("//*[@id='part24']/div[1]/div[1]/text()").extract()
            part24_nat = sel.xpath("//*[@id='part24']/div[1]/div[2]/text()").extract()          
            part25 = sel.xpath("//*[@id='part25']/div[1]/div[1]/text()").extract()
            part25_nat = sel.xpath("//*[@id='part25']/div[1]/div[2]/text()").extract()          
            part26 = sel.xpath("//*[@id='part26']/div[1]/div[1]/text()").extract()
            part26_nat = sel.xpath("//*[@id='part26']/div[1]/div[2]/text()").extract()          
            part27 = sel.xpath("//*[@id='part27']/div[1]/div[1]/text()").extract()
            part27_nat = sel.xpath("//*[@id='part27']/div[1]/div[2]/text()").extract()          
            part28 = sel.xpath("//*[@id='part28']/div[1]/div[1]/text()").extract()
            part28_nat = sel.xpath("//*[@id='part28']/div[1]/div[2]/text()").extract()          
            part29 = sel.xpath("//*[@id='part29']/div[1]/div[1]/text()").extract()
            part29_nat = sel.xpath("//*[@id='part29']/div[1]/div[2]/text()").extract()          
            part30 = sel.xpath("//*[@id='part30']/div[1]/div[1]/text()").extract()
            part30_nat = sel.xpath("//*[@id='part30']/div[1]/div[2]/text()").extract()          
            part31 = sel.xpath("//*[@id='part31']/div[1]/div[1]/text()").extract()
            part31_nat = sel.xpath("//*[@id='part31']/div[1]/div[2]/text()").extract()      
            part32 = sel.xpath("//*[@id='part32']/div[1]/div[1]/text()").extract()
            part32_nat = sel.xpath("//*[@id='part32']/div[1]/div[2]/text()").extract()
            part33 = sel.xpath("//*[@id='part33']/div[1]/div[1]/text()").extract()
            part33_nat = sel.xpath("//*[@id='part33']/div[1]/div[2]/text()").extract()
            part34 = sel.xpath("//*[@id='part34']/div[1]/div[1]/text()").extract()
            part34_nat = sel.xpath("//*[@id='part34']/div[1]/div[2]/text()").extract()
            part35 = sel.xpath("//*[@id='part35']/div[1]/div[1]/text()").extract()
            part35_nat = sel.xpath("//*[@id='part35']/div[1]/div[2]/text()").extract()
            part36 = sel.xpath("//*[@id='part36']/div[1]/div[1]/text()").extract()
            part36_nat = sel.xpath("//*[@id='part36']/div[1]/div[2]/text()").extract()
            part37 = sel.xpath("//*[@id='part37']/div[1]/div[1]/text()").extract()
            part37_nat = sel.xpath("//*[@id='part37']/div[1]/div[2]/text()").extract()
            part38 = sel.xpath("//*[@id='part38']/div[1]/div[1]/text()").extract()
            part38_nat = sel.xpath("//*[@id='part38']/div[1]/div[2]/text()").extract()
            part39 = sel.xpath("//*[@id='part39']/div[1]/div[1]/text()").extract()
            part39_nat = sel.xpath("//*[@id='part39']/div[1]/div[2]/text()").extract()
            part40 = sel.xpath("//*[@id='part40']/div[1]/div[1]/text()").extract()
            part40_nat = sel.xpath("//*[@id='part40']/div[1]/div[2]/text()").extract()      
        print acronym, short_desc, start, end, long_desc, cost, contrib, type, sujet, coord, coord_nat, part1, part1_nat, part2, part2_nat, part5, part5_nat, part10, part10_nat, part20, part20_nat, part30, part30_nat, part40, part40_nat

我可以通过使用 Netbug 为第一个 Spider 产生的每个 URL 过滤 XHR 请求来手动检索,由于缺乏更好的术语,我称之为AJAX"URL.然后,我只需要将这些AJAX" URL 提供给第二个 Spider.

I could manually retrieve what, for lack of better terms, I have called the "AJAX" URLs by filtering XHR requests with Netbug for each of the URLs yielded by the first Spider. Then, I would just have to feed those "AJAX" URLs to the second Spider.

但是是否有可能自动检索那些AJAX"网址?

But is it possible to automatically retrieve those "AJAX"URLs?

更一般地说,如何编写一个执行上述所有三个操作的爬行蜘蛛?

More generally, how to write a single crawl spider performing all three operations described in the above?

推荐答案

是的,可以自动检索这些 url,但是您必须弄清楚 ajax 加载内容的 url 是什么.这是一个简单的教程.

Yes it is possible to automatically retrieve those urls, but you have to figure out what is the url from which ajax loads content. Here's a simple tutorial.

1.做你的研究

在 chrome 控制台中,如果您打开网络选项卡,并按 xml 请求过滤,您将获得启动器"字段.在右侧,您有 javascript 文件,其中包含负责生成请求的代码.Chrome 控制台显示正在调用请求的行.

In chrome console if you open network tab, and filter by xml requests, you get 'initiator' field. On the right you have javascript files that contain code responsible for generating requests. Chrome console displays lines from which request is being called.

就你而言,最重要的一段代码是在文件 jquery-projects.js 的第 415 行中,该行内容如下:

In your case the most important piece of code is in file jquery-projects.js, line 415, the line says something like this:

    $.ajax({
        async:      true,
        type:       'GET',
        url:        URL,

如您所见,此处有一个 URL 变量.你需要找到它的编码位置,就在上面几行:

as you see there is an URL variable here. You need to find where it is coded, just a couple lines above:

    var URL = '/projects/index.cfm?fuseaction=app.csa'; // production

    switch(type) {
        ...
        case 'doc':
            URL += '&action=read&xslt-template=projects/xsl/projectdet_' + I18n.locale + '.xslt&rcn=' + me.ref;
            break;
    }

所以 url 是通过添加基本 url、一些以 action 开头的字符串以及两个变量 I18n.locale 和 me.ref 生成的.请记住,此 url 是相对的,因此您还需要获取 url 根.

So the url is generated by adding base url, some string starting with action and then two variables I18n.locale and me.ref. Keep in mind that this url is relative so you need to get also url root.

I18n.locale 原来只是一个字符串_en",me.ref 是从哪里来的?

I18n.locale turns out to be just a string "_en", where does me.ref come from?

再次 ctrl + find in console in sources 选项卡,你会发现这行 jQuery:

Again ctrl + find in console in sources tab and you find this line of jQuery:

    // record reference
    me.ref = $("#PrjSrch>input[name='REF']").val();

原来每个 url 都有一个隐藏的表单,每次生成请求时,它都会从这个 me.ref 字段中获取值.

Turns out there is a hidden form there for each url and each time request is generated it takes value from this me.ref field.

现在你只需要将这些知识应用到你的scrapy项目中.

Now you only need to apply this knowledge to your scrapy project.

2.使用您在爬虫方面的知识.

此时你知道你必须做什么.您需要从所有项目的起始 url 开始,获取所有链接,对这些链接发出请求,然后从每个请求后收到的内容中提取 ajax url,并生成对我们从那里获得的 url 的请求.

At this point you know what you have to do. You need to start with start url for all projects, get all the links, make requests for those links, then extract ajax url from the content received after each request, and generate requests for urls that we've got from there.

from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from eu.items import EuItem
from urlparse import urljoin


class CordisSpider(Spider):
    name = 'cordis'
    start_urls = ['http://cordis.europa.eu/fp7/security/projects_en.html']
    base_url = "http://cordis.europa.eu/projects/"
    # template string for ajax request based on what we know from investigating webpage
    base_ajax_url = "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=%s"

    def parse(self, response):
        """
        Extract project links from start_url, for each generate GET request,
        and then assign a function self.get_ajax_content to handle response.
        """
        hxs = Selector(response)
        links = hxs.xpath("//ul/li/span/a/@href").extract()
        for link in links:
            link = urljoin(self.base_url,link)
            yield Request(url=link,callback=self.get_ajax_content)

    def get_ajax_content(self,response):
        """
        Extract AJAX link and make a GET request
        for the desired content, assign callback
        to handle response from this request.
        """
        hxs = Selector(response)
        # xpath analogy of jquery line we've seen
        ajax_ref = hxs.xpath('//form[@id="PrjSrch"]//input[@name="REF"]/@value').extract()
        ajax_ref = "".join(ajax_ref)
        ajax_url = self.base_ajax_url % (ajax_ref,)
        yield Request(url=ajax_url,callback=self.parse_items)

    def parse_items(self,response):
        """
        Response here should contain content
        normally loaded asynchronously with AJAX.
        """
        xhs = Selector(response)
        # you can do your processing here
        title = xhs.xpath("//div[@class='projttl']//text()").extract()
        i = EuItem()
        i["title"] = title
        return i  

这篇关于如何自动检索 URL AJAX 调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆