硒AJAX动态分页基地蜘蛛 [英] selenium ajax dynamic pagination base spider

查看:142
本文介绍了硒AJAX动态分页基地蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图运行我的基地蜘蛛动态分页,但我没有得到成功抓取。我已经使用硒AJAX动态分页。我使用的网址是: http://www.demo.com 。这是我的code:

I am trying to run my Base spider for dynamic pagination but I am not getting success in crawling. I have used selenium ajax dynamic pagination. the url I am using is: http://www.demo.com. Here is my code:

# -*- coding: utf-8 -*-
import scrapy

import re

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.selector import Selector

from scrapy.spider import BaseSpider

from demo.items import demoItem

from selenium import webdriver

def removeUnicodes(strData):

        if(strData):

            #strData = strData.decode('unicode_escape').encode('ascii','ignore')

            strData = strData.encode('utf-8').strip() 

            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())

            #print 'Output:',strData

        return strData


class demoSpider(scrapy.Spider):
    name = "demourls"

    allowed_domains = ["demo.com"]

    start_urls = ['http://www.demo.com']

     def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        print "*****************************************************"
        self.driver.get(response.url)
        print response.url
        print "______________________________"



        hxs = Selector(response)
        item = sumItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
                print '**********************************************2***url*****************************************',urls


                for url in urls:
                    print '---------url-------',url
                finalurls.append(url)          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()

        return item

我items.py是

my items.py is

from scrapy.item import Item, Field


class demoItem(Item):



    page = Field()
    urls = Field()

    pageurl = Field()
    title = Field()

当我试图抓取和转换的JSON,我让我的JSON文件为:

when I am trying to crawl it and convert it in json I am getting my json file as:

[{"pageurl": "http://www.demo.com", "urls": [], "title": "demo"}]

我不能够抓取,因为它是动态加载的所有URL。

I am not able to crawl all urls as it is dynamically loading.

推荐答案

我希望下面的code会有所帮助。

I hope the below code will help.

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver

def removeUnicodes(strData):
        if(strData):
            strData = strData.encode('utf-8').strip() 
            strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
        return strData

class demoSpider(scrapy.Spider):
    name = "domainurls"
    allowed_domains = ["domain.com"]
    start_urls = ['http://www.domain.com/used/cars-in-trichy/']

    def __init__(self):
        self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.implicitly_wait(5)
        hxs = Selector(response)
        item = DemoItem()
        finalurls = []
        while True:
            next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')

            try:
                next.click()
                # get the data and write it to scrapy items
                item['pageurl'] = response.url
                item['title'] =  removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
                urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')

                for url in urls:
                    url = url.get_attribute("href")
                    finalurls.append(removeUnicodes(url))          

                item['urls'] = finalurls

            except:
                break

        self.driver.close()
        return item

items.py

from scrapy.item import Item, Field

class DemoItem(Item):
    page = Field()
    urls = Field()
    pageurl = Field()
    title = Field()

注意: 你需要有Selenium RC的服务器上运行,因为HTMLUNITWITHJS可与硒区局只使用Python。

Note: You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.

运行您的Selenium RC服务器发出命令

java -jar selenium-server-standalone-2.44.0.jar

使用命令来运行你的蜘蛛

spider crawl domainurls -o someoutput.json

这篇关于硒AJAX动态分页基地蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆