scrapy抓取蜘蛛AJAX分页 [英] scrapy crawl spider ajax pagination

查看:721
本文介绍了scrapy抓取蜘蛛AJAX分页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是想放弃链接,具有AJAX调用分页。 我想抓取 http://www.demo.com 链接。在.py文件我提供了这个$ C $下限制XPath和编码:

I was trying to scrap link which has ajax call for pagination. I am trying to crawl http://www.demo.com link. and in .py file I provided this code for restrict XPATH and coding is:

进口scrapy

从scrapy.contrib.linkextractors进口LinkExtractor

from scrapy.contrib.linkextractors import LinkExtractor

从scrapy.contrib.spiders进口sumSpider,规则

from scrapy.contrib.spiders import sumSpider, Rule

从scrapy.selector进口HtmlXPathSelector

from scrapy.selector import HtmlXPathSelector

从sum.items进口sumItem

from sum.items import sumItem

类Sumspider1(sumSpider):

class Sumspider1(sumSpider):

     name = 'sumDetailsUrls'

     allowed_domains = ['sum.com']

start_urls = ['http://www.demo.com']

rules = (
    Rule(LinkExtractor(restrict_xpaths='.//ul[@id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
)


#use parse_start_url if your spider wants to crawl from first page , so overriding 
def parse_start_url(self, response):
    print '********************************************1**********************************************'
    #//div[@class="showMoreCars hide"]/a
    #.//ul[@id="pager"]/li[8]/a/@href
    self.log('Inside - parse_item %s' % response.url)
    hxs = HtmlXPathSelector(response)
    item = sumItem()
    item['page'] = response.url
    title = hxs.xpath('.//h1[@class="page-heading"]/text()').extract() 
    print '********************************************title**********************************************',title
    urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
    print '**********************************************2***url*****************************************',urls
    finalurls = []       

    for url in urls:
        print '---------url-------',url
    finalurls.append(url)          



    item['urls'] = finalurls
    return item

我的items.py文件包含

My items.py file contains

from scrapy.item import Item, Field


class sumItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    page = Field()
    urls = Field()

我依然没有得到确切的输出无法获取所有网页的时候,我爬了。

Still I'm not getting exact output not able to fetch all pages when I am crawling it.

推荐答案

我希望下面的code会有所帮助。

I hope the below code will help.

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver

def removeUnicodes(strData):
        if(strData):
            strData = strData.encode('utf-8').strip() 
            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
        return strData

class demoSpider(scrapy.Spider):
    name = "domainurls"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/used/cars-in-trichy/']

    def __init__(self):
        self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.implicitly_wait(5)
        hxs = Selector(response)
        item = DemoItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')

                for url in urls:
                    url = url.get_attribute("href")
                    finalurls.append(removeUnicodes(url))          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()
        return item

items.py

from scrapy.item import Item, Field

class DemoItem(Item):
    page = Field()
    urls = Field()
    pageurl = Field()
    title = Field()

注意: 你需要有Selenium RC的服务器上运行,因为HTMLUNITWITHJS可与硒区局只使用Python。

Note: You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.

运行您的Selenium RC服务器发出命令

java -jar selenium-server-standalone-2.44.0.jar

使用命令来运行你的蜘蛛

spider crawl domainurls -o someoutput.json

这篇关于scrapy抓取蜘蛛AJAX分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆