如何自动检索的URL的AJAX调用? [英] How to automatically retrieve URL AJAX calls to?

查看:175
本文介绍了如何自动检索的URL的AJAX调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

的目的是编程crawlspider能够:

The aim is to programme a crawlspider able to:

1)获取了在本页面的表中的链接的URL: HTTP ://cordis.europa.eu/fp7/security/projects_en.html

1) Retrieve the URL of the links that are in the table of this page : http://cordis.europa.eu/fp7/security/projects_en.html

2)按照所有这些URL的AJAX调用,找出最后的(AJAX),其中包含的数据,我想刮网址

2) Follow the AJAX call from all those URLs to find out the final ("AJAX")URLs containing the data that I want to scrape

3)刮确定的AJAX网址的最后几页。

3)Scrape the final pages identified by the AJAX URLs.

到目前为止,我已经写了Scrapy下的两个蜘蛛:

So far, I have written two spiders under Scrapy:

1)第一个检索从一开始页面上的链接的URL。这里是code:

1) The first one retrieves the URL from the links on the start page. Here is the code:

   from scrapy.spider import Spider
   from scrapy.selector import HtmlXPathSelector
   from cordis.items import CordisItem

   class MySpider(Spider):
       name = "Cordis1"
       allowed_domains = ["cordis.europa.eu"]
       start_urls = ["http://cordis.europa.eu/fp7/security/projects_en.html"]

       def parse(self, response):
           hxs = HtmlXPathSelector(response)
           titles = hxs.select("//p")
           items = []
           for titles in titles:
               item = CordisItem()
               item ["link"] = titles.select("//ul/li/span/a/@href").extract()
           return item

2)第二个刮削掉数据的AJAX网址。这里是code:

2) The second one scrapes the data off of the "AJAX" URLs. Here is the code:

from scrapy.spider import Spider
from scrapy.selector import Selector

class EssaiSpider(Spider):
    name = "aze"
    allowed_domains = ["cordis.europa.eu"]
    start_urls = ["http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=95607",
    "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=93528"]

    def parse(self, response):
        sel = Selector(response)
        acronym = sel.xpath("//*[@class='projttl']/h1/text()").extract()
        short_desc = sel.xpath("//*[@class='projttl']/h2/text()").extract()
        start = sel.xpath("//*[@class='projdates']/b[1]/following::text()[1]").extract()
        end = sel.xpath("//*[@class='projdates']/b[2]/following::text()[1]").extract()
        long_desc = sel.xpath("//*[@class='tech']/p/text()").extract()
        cost = sel.xpath("//*[@class='box-left']/b[3]/following::text()[1]").extract()
        contrib = sel.xpath("//*[@class='box-left']/b[4]/following::text()[1]").extract()
        type = sel.xpath("//*[@class='box-right']/p[3]/br/following::text()[1]").extract()
        sujet = sel.xpath("//*[@id='subjects']/h2/following::text()[1]").extract()
        coord = sel.xpath("//*[@class='projcoord']/div[1]/div[1]/text()").extract()
        coord_nat = sel.xpath("//*[@class='projcoord']/div[1]/div[2]/text()").extract()
        part = sel.xpath("//*[@class='participants']")
        for part in part:
            part1 = sel.xpath("//*[@id='part1']/div[1]/div[1]/text()").extract()
            part1_nat = sel.xpath("//*[@id='part1']/div[1]/div[2]/text()").extract()
            part2 = sel.xpath("//*[@id='part2']/div[1]/div[1]/text()").extract()
            part2_nat = sel.xpath("//*[@id='part2']/div[1]/div[2]/text()").extract()
            part3 = sel.xpath("//*[@id='part3']/div[1]/div[1]/text()").extract()
            part3_nat = sel.xpath("//*[@id='part3']/div[1]/div[2]/text()").extract()
            part4 = sel.xpath("//*[@id='part4']/div[1]/div[1]/text()").extract()
            part4_nat = sel.xpath("//*[@id='part4']/div[1]/div[2]/text()").extract()            
            part5 = sel.xpath("//*[@id='part5']/div[1]/div[1]/text()").extract()
            part5_nat = sel.xpath("//*[@id='part5']/div[1]/div[2]/text()").extract()            
            part6 = sel.xpath("//*[@id='part6']/div[1]/div[1]/text()").extract()
            part6_nat = sel.xpath("//*[@id='part6']/div[1]/div[2]/text()").extract()            
            part7 = sel.xpath("//*[@id='part7']/div[1]/div[1]/text()").extract()
            part7_nat = sel.xpath("//*[@id='part7']/div[1]/div[2]/text()").extract()            
            part8 = sel.xpath("//*[@id='part8']/div[1]/div[1]/text()").extract()
            part8_nat = sel.xpath("//*[@id='part8']/div[1]/div[2]/text()").extract()        
            part9 = sel.xpath("//*[@id='part9']/div[1]/div[1]/text()").extract()
            part9_nat = sel.xpath("//*[@id='part9']/div[1]/div[2]/text()").extract()            
            part10 = sel.xpath("//*[@id='part10']/div[1]/div[1]/text()").extract()
            part10_nat = sel.xpath("//*[@id='part10']/div[1]/div[2]/text()").extract()          
            part11 = sel.xpath("//*[@id='part11']/div[1]/div[1]/text()").extract()
            part11_nat = sel.xpath("//*[@id='part11']/div[1]/div[2]/text()").extract()          
            part12 = sel.xpath("//*[@id='part11']/div[1]/div[1]/text()").extract()
            part12_nat = sel.xpath("//*[@id='part11']/div[1]/div[2]/text()").extract()          
            part13 = sel.xpath("//*[@id='part13']/div[1]/div[1]/text()").extract()
            part13_nat = sel.xpath("//*[@id='part13']/div[1]/div[2]/text()").extract()          
            part13 = sel.xpath("//*[@id='part13']/div[1]/div[1]/text()").extract()
            part13_nat = sel.xpath("//*[@id='part13']/div[1]/div[2]/text()").extract()          
            part14 = sel.xpath("//*[@id='part14']/div[1]/div[1]/text()").extract()
            part14_nat = sel.xpath("//*[@id='part14']/div[1]/div[2]/text()").extract()          
            part15 = sel.xpath("//*[@id='part15']/div[1]/div[1]/text()").extract()
            part15_nat = sel.xpath("//*[@id='part15']/div[1]/div[2]/text()").extract()          
            part16 = sel.xpath("//*[@id='part16']/div[1]/div[1]/text()").extract()
            part16_nat = sel.xpath("//*[@id='part16']/div[1]/div[2]/text()").extract()          
            part17 = sel.xpath("//*[@id='part17']/div[1]/div[1]/text()").extract()
            part17_nat = sel.xpath("//*[@id='part17']/div[1]/div[2]/text()").extract()          
            part18 = sel.xpath("//*[@id='part18']/div[1]/div[1]/text()").extract()
            part18_nat = sel.xpath("//*[@id='part18']/div[1]/div[2]/text()").extract()      
            part19 = sel.xpath("//*[@id='part19']/div[1]/div[1]/text()").extract()
            part2_nat = sel.xpath("//*[@id='part19']/div[1]/div[2]/text()").extract()       
            part20 = sel.xpath("//*[@id='part20']/div[1]/div[1]/text()").extract()
            part20_nat = sel.xpath("//*[@id='part20']/div[1]/div[2]/text()").extract()          
            part21 = sel.xpath("//*[@id='part21']/div[1]/div[1]/text()").extract()
            part21_nat = sel.xpath("//*[@id='part21']/div[1]/div[2]/text()").extract()          
            part22 = sel.xpath("//*[@id='part22']/div[1]/div[1]/text()").extract()
            part22_nat = sel.xpath("//*[@id='part22']/div[1]/div[2]/text()").extract()          
            part23 = sel.xpath("//*[@id='part23']/div[1]/div[1]/text()").extract()
            part23_nat = sel.xpath("//*[@id='part23']/div[1]/div[2]/text()").extract()  
            part24 = sel.xpath("//*[@id='part24']/div[1]/div[1]/text()").extract()
            part24_nat = sel.xpath("//*[@id='part24']/div[1]/div[2]/text()").extract()          
            part25 = sel.xpath("//*[@id='part25']/div[1]/div[1]/text()").extract()
            part25_nat = sel.xpath("//*[@id='part25']/div[1]/div[2]/text()").extract()          
            part26 = sel.xpath("//*[@id='part26']/div[1]/div[1]/text()").extract()
            part26_nat = sel.xpath("//*[@id='part26']/div[1]/div[2]/text()").extract()          
            part27 = sel.xpath("//*[@id='part27']/div[1]/div[1]/text()").extract()
            part27_nat = sel.xpath("//*[@id='part27']/div[1]/div[2]/text()").extract()          
            part28 = sel.xpath("//*[@id='part28']/div[1]/div[1]/text()").extract()
            part28_nat = sel.xpath("//*[@id='part28']/div[1]/div[2]/text()").extract()          
            part29 = sel.xpath("//*[@id='part29']/div[1]/div[1]/text()").extract()
            part29_nat = sel.xpath("//*[@id='part29']/div[1]/div[2]/text()").extract()          
            part30 = sel.xpath("//*[@id='part30']/div[1]/div[1]/text()").extract()
            part30_nat = sel.xpath("//*[@id='part30']/div[1]/div[2]/text()").extract()          
            part31 = sel.xpath("//*[@id='part31']/div[1]/div[1]/text()").extract()
            part31_nat = sel.xpath("//*[@id='part31']/div[1]/div[2]/text()").extract()      
            part32 = sel.xpath("//*[@id='part32']/div[1]/div[1]/text()").extract()
            part32_nat = sel.xpath("//*[@id='part32']/div[1]/div[2]/text()").extract()
            part33 = sel.xpath("//*[@id='part33']/div[1]/div[1]/text()").extract()
            part33_nat = sel.xpath("//*[@id='part33']/div[1]/div[2]/text()").extract()
            part34 = sel.xpath("//*[@id='part34']/div[1]/div[1]/text()").extract()
            part34_nat = sel.xpath("//*[@id='part34']/div[1]/div[2]/text()").extract()
            part35 = sel.xpath("//*[@id='part35']/div[1]/div[1]/text()").extract()
            part35_nat = sel.xpath("//*[@id='part35']/div[1]/div[2]/text()").extract()
            part36 = sel.xpath("//*[@id='part36']/div[1]/div[1]/text()").extract()
            part36_nat = sel.xpath("//*[@id='part36']/div[1]/div[2]/text()").extract()
            part37 = sel.xpath("//*[@id='part37']/div[1]/div[1]/text()").extract()
            part37_nat = sel.xpath("//*[@id='part37']/div[1]/div[2]/text()").extract()
            part38 = sel.xpath("//*[@id='part38']/div[1]/div[1]/text()").extract()
            part38_nat = sel.xpath("//*[@id='part38']/div[1]/div[2]/text()").extract()
            part39 = sel.xpath("//*[@id='part39']/div[1]/div[1]/text()").extract()
            part39_nat = sel.xpath("//*[@id='part39']/div[1]/div[2]/text()").extract()
            part40 = sel.xpath("//*[@id='part40']/div[1]/div[1]/text()").extract()
            part40_nat = sel.xpath("//*[@id='part40']/div[1]/div[2]/text()").extract()      
        print acronym, short_desc, start, end, long_desc, cost, contrib, type, sujet, coord, coord_nat, part1, part1_nat, part2, part2_nat, part5, part5_nat, part10, part10_nat, part20, part20_nat, part30, part30_nat, part40, part40_nat

我可以手动检索什么,因为缺乏更好的条件,我已经通过过滤XHR请求与Netbug每个由第一蜘蛛产生的网址被称为AJAX的网址。然后,我只是要养活那些AJAX网址到第二个蜘蛛。

I could manually retrieve what, for lack of better terms, I have called the "AJAX" URLs by filtering XHR requests with Netbug for each of the URLs yielded by the first Spider. Then, I would just have to feed those "AJAX" URLs to the second Spider.

但是,它可以自动检索这些Ajax的网址?

But is it possible to automatically retrieve those "AJAX"URLs?

更普遍,如何写一个蜘蛛爬在执行上述所有三种操作?

More generally, how to write a single crawl spider performing all three operations described in the above?

推荐答案

是的,它可以自动检索这些网址,但你必须弄清楚什么是URL从阿贾克斯负载的内容。这里有一个简单的教程。

Yes it is possible to automatically retrieve those urls, but you have to figure out what is the url from which ajax loads content. Here's a simple tutorial.

1。做你的研究

在Chrome浏览器的控制台,如果你开放的网络标签和过滤器通过XML请求,你会得到启动字段。在右边你有包含code负责生成请求JavaScript文件。 Chrome浏览器控制台显示线从哪个请求被调用。

In chrome console if you open network tab, and filter by xml requests, you get 'initiator' field. On the right you have javascript files that contain code responsible for generating requests. Chrome console displays lines from which request is being called.

在你的情况下,code中的最重要的一条是 在文件的jQuery projects.js,行415,该行表示,这样的事情:

In your case the most important piece of code is in file jquery-projects.js, line 415, the line says something like this:

    $.ajax({
        async:      true,
        type:       'GET',
        url:        URL,

因为你看到这里存在一个URL变量。你需要找到它在哪里codeD,只是一对夫妇线以上:

as you see there is an URL variable here. You need to find where it is coded, just a couple lines above:

    var URL = '/projects/index.cfm?fuseaction=app.csa'; // production

    switch(type) {
        ...
        case 'doc':
            URL += '&action=read&xslt-template=projects/xsl/projectdet_' + I18n.locale + '.xslt&rcn=' + me.ref;
            break;
    }

所以,URL通过基本URL生成一些字符串开始操作,然后两个变量I18n.locale和me.ref。请记住,所以你也需要得到的URL根这个URL是相对的。

So the url is generated by adding base url, some string starting with action and then two variables I18n.locale and me.ref. Keep in mind that this url is relative so you need to get also url root.

I18n.locale原来是只是一个字符串_en,其中确实me.ref从何而来?

I18n.locale turns out to be just a string "_en", where does me.ref come from?

再次按Ctrl +找到控制台源选项卡中,你会发现这行的jQuery的:

Again ctrl + find in console in sources tab and you find this line of jQuery:

    // record reference
    me.ref = $("#PrjSrch>input[name='REF']").val();

原来有一个隐藏的形式存在于每一个URL,它需要的价值,从这个me.ref领域的每一次请求。

Turns out there is a hidden form there for each url and each time request is generated it takes value from this me.ref field.

现在你只需要这方面的知识应用到您的scrapy项目。

Now you only need to apply this knowledge to your scrapy project.

2。在scrapy蜘蛛用你的知识。

在这一点上,你知道你必须做的。你需要开始与起始URL的所有项目,让所有的环节,提出请求的这些链接,然后从每个请求后,接收到的内容中提取AJAX的网址,并为我们已经从那里得到URL的请求。

At this point you know what you have to do. You need to start with start url for all projects, get all the links, make requests for those links, then extract ajax url from the content received after each request, and generate requests for urls that we've got from there.

from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.http import Request
from eu.items import EuItem
from urlparse import urljoin


class CordisSpider(Spider):
    name = 'cordis'
    start_urls = ['http://cordis.europa.eu/fp7/security/projects_en.html']
    base_url = "http://cordis.europa.eu/projects/"
    # template string for ajax request based on what we know from investigating webpage
    base_ajax_url = "http://cordis.europa.eu/projects/index.cfm?fuseaction=app.csa&action=read&xslt-template=projects/xsl/projectdet_en.xslt&rcn=%s"

    def parse(self, response):
        """
        Extract project links from start_url, for each generate GET request,
        and then assign a function self.get_ajax_content to handle response.
        """
        hxs = Selector(response)
        links = hxs.xpath("//ul/li/span/a/@href").extract()
        for link in links:
            link = urljoin(self.base_url,link)
            yield Request(url=link,callback=self.get_ajax_content)

    def get_ajax_content(self,response):
        """
        Extract AJAX link and make a GET request
        for the desired content, assign callback
        to handle response from this request.
        """
        hxs = Selector(response)
        # xpath analogy of jquery line we've seen
        ajax_ref = hxs.xpath('//form[@id="PrjSrch"]//input[@name="REF"]/@value').extract()
        ajax_ref = "".join(ajax_ref)
        ajax_url = self.base_ajax_url % (ajax_ref,)
        yield Request(url=ajax_url,callback=self.parse_items)

    def parse_items(self,response):
        """
        Response here should contain content
        normally loaded asynchronously with AJAX.
        """
        xhs = Selector(response)
        # you can do your processing here
        title = xhs.xpath("//div[@class='projttl']//text()").extract()
        i = EuItem()
        i["title"] = title
        return i  

这篇关于如何自动检索的URL的AJAX调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆