硒AJAX动态分页基地蜘蛛 [英] selenium ajax dynamic pagination base spider
问题描述
我试图运行我的基地蜘蛛动态分页,但我没有得到成功抓取。我已经使用硒AJAX动态分页。我使用的网址是: http://www.demo.com 。这是我的code:
I am trying to run my Base spider for dynamic pagination but I am not getting success in crawling. I have used selenium ajax dynamic pagination. the url I am using is: http://www.demo.com. Here is my code:
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import demoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
#strData = strData.decode('unicode_escape').encode('ascii','ignore')
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
#print 'Output:',strData
return strData
class demoSpider(scrapy.Spider):
name = "demourls"
allowed_domains = ["demo.com"]
start_urls = ['http://www.demo.com']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
print "*****************************************************"
self.driver.get(response.url)
print response.url
print "______________________________"
hxs = Selector(response)
item = sumItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
urls = hxs.xpath('.//a[@id="linkToDetails"]/@href').extract()
print '**********************************************2***url*****************************************',urls
for url in urls:
print '---------url-------',url
finalurls.append(url)
item['urls'] = finalurls
except:
break
self.driver.close()
return item
我items.py是
my items.py is
from scrapy.item import Item, Field
class demoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
当我试图抓取和转换的JSON,我让我的JSON文件为:
when I am trying to crawl it and convert it in json I am getting my json file as:
[{"pageurl": "http://www.demo.com", "urls": [], "title": "demo"}]
我不能够抓取,因为它是动态加载的所有URL。
I am not able to crawl all urls as it is dynamically loading.
推荐答案
我希望下面的code会有所帮助。
I hope the below code will help.
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
return strData
class demoSpider(scrapy.Spider):
name = "domainurls"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/used/cars-in-trichy/']
def __init__(self):
self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(5)
hxs = Selector(response)
item = DemoItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[@class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[@class="page-heading"]/text()').extract()[0])
urls = self.driver.find_elements_by_xpath('.//a[@id="linkToDetails"]')
for url in urls:
url = url.get_attribute("href")
finalurls.append(removeUnicodes(url))
item['urls'] = finalurls
except:
break
self.driver.close()
return item
items.py
from scrapy.item import Item, Field
class DemoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
注意: 你需要有Selenium RC的服务器上运行,因为HTMLUNITWITHJS可与硒区局只使用Python。
Note: You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.
运行您的Selenium RC服务器发出命令
java -jar selenium-server-standalone-2.44.0.jar
使用命令来运行你的蜘蛛
spider crawl domainurls -o someoutput.json
这篇关于硒AJAX动态分页基地蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!