为什么我的 CrawlerProcess 没有“crawl"功能? [英] Why does my CrawlerProcess not have the function "crawl"?
问题描述
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import BackpageItem, CityvibeItem
from scrapy.shell import inspect_response
import re
import time
import sys
class MySpider(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
# Set last_age to decide how many pages are crawled
last_page = 10
start_urls = ['http://www.example.com/washington/?page=%s' % page for page in xrange(1,last_page)]
rules = (
#Follow all links inside <div class="cat"> and calls parse_item on each link
Rule(LinkExtractor(
restrict_xpaths=('//a[@name="listing_link"]')),
callback='parse_item'),
)
# Extract relevent text from the website into a ExampleItem
def parse_item(self, response):
item = ExampleItem()
item['title'] = response.xpath('string(//h2[@class="post-title"]/text())').extract()
item['desc'] = response.xpath('string(//div[@class="section post-body"]/text())').extract()
item['url'] = response.url
item['location'] = response.xpath('string(//div[@class="posting"]/div[2]/text())').extract()
item['posted_date'] = response.xpath('string(//div[@class="post-date"]/span/text())').extract()#.re("(?<=Posted\s*).*")
item['crawled_date'] = time.strftime("%c")
# not sure how to get the other image urls right now
item['image_urls'] = response.xpath('string(//div[@class="section post-contact-container"]/div/div/img/@src)').extract()
# I can't find this section on any pages right now
item['other_ad_urls'] = response.xpath('//a[@name="listing_link"]/@href').extract()
item['phone_number'] = "".join(response.xpath('//div[@class="post-info"]/span[contains(text(), "Phone")]/following-sibling::a/text()').extract())
item['email'] = "".join(response.xpath('//div[@class="post-info"]/span[contains(text(), "Email")]/following-sibling::a/text()').extract())
item['website'] = "".join(response.xpath('//div[@class="post-info limit"]/span[contains(text(), "Website")]/following-sibling::a/text()').extract())
item['name'] = response.xpath('//div[@class="post-name"]/text()').extract()
#uncomment for debugging
#inspect_response(response, self)
return item
# process1 = CrawlerProcess({
# 'ITEM_PIPELINES': {
# #'scrapy.contrib.pipeline.images.ImagesPipeline': 1
# 'backpage.pipelines.GeolocationPipeline': 4,
# 'backpage.pipelines.LocationExtractionPipeline': 3,
# 'backpage.pipelines.BackpagePipeline': 5
# }
# });
process1 = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process1.crawl(MySpider)
process1.start()
当我从命令行运行它时,我的蜘蛛运行良好
My spider works perfectly when I run it from the command line with
scrapy crawl example
但我需要运行多个蜘蛛,所以我想将它们全部放在一个脚本中并使用 CrawlerProcess.当我尝试运行此程序时出现错误,
but I will need to run multiple spiders, so I want to put them all in a script and use CrawlerProcess. When I try to run this I get the error,
AttributeError: 'CrawlerProcess' object has no attribute 'crawl'
这是scrapy 0.24.6 版.所有项目和管道都是正确的,因为蜘蛛从命令行工作.
This is scrapy version 0.24.6. All items and pipelines are correct, because the spider works from the command line.
推荐答案
Scrapy 和 Scrapyd 之间存在(曾经?)兼容性问题.我需要运行 Scrapy 0.24 和 Scrapyd 1.0.1.这是 Github 上的问题https://github.com/scrapy/scrapyd/issues/100#issuecomment-115268880
There is (was?) a compatibility problem between Scrapy and Scrapyd. I needed to run Scrapy 0.24 and Scrapyd 1.0.1. Here is the issue on Github https://github.com/scrapy/scrapyd/issues/100#issuecomment-115268880
这篇关于为什么我的 CrawlerProcess 没有“crawl"功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!