如何使用scrapy框架抓取二级网站 [英] How can I crawl the two level website with scrapy framework
问题描述
我想抓取一个有两级网址的网站,第一级是多页列表,网址如下:
I want crawl a website which has two level url,the first level is a muti-page list ,url like this:
http://www.example.com/group/p{n}/
页面布局如下:
- 列表项链接 1
- 列表项链接 2
- 列表项链接 3
- 列表项链接 4
1,2,3,4,5 ... 下一页
1,2,3,4,5 ... nextpage
第二层是详情页,网址如下:
and the second level is a detail page,url like this:
http://www.example.com/group/view/{n}/
我的蜘蛛代码是:
import scrapy
from scrapy.spiders.crawl import CrawlSpider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule
from urlparse import urljoin
class MyCrawler(CrawlSpider):
name = "AnjukeCrawler"
start_urls=[
"http://www.example.com/group/"
]
rules = [
Rule(LxmlLinkExtractor(allow=(),
restrict_xpaths=(["//div[@class='multi- page']/a[@class='aNxt']"])),
callback='parse_list_page',
follow=True)
]
def parse_list_page(self, response):
list_page=response.xpath("//div[@class='li- itemmod']/div/h3/a/@href").extract()
for item in list_page:
yield scrapy.http.Request(self,url=urljoin(response.url,item),callback=self.parse_detail_page)
def parse_detail_page(self,response):
community_name=response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()
self.log(community_name,2)
我的问题是:我的 parse_detail_page 似乎从来没有运行过,有人能告诉我为什么吗?我该如何解决?
谢谢!
推荐答案
你不应该覆盖 CrawlSpider
的 parse
方法,因为它包含了这类蜘蛛的核心解析逻辑,所以你的 def parse(
应该是 def parse_list_page(
- 那个错字是你的问题.
You should never overwrite parse
method of CrawlSpider
because it contains core parsing logic for this type of spiders, so your def parse(
should be def parse_list_page(
- and that typo is your issue.
然而,由于使用回调和 follow=True
只是为了提取链接,您的规则看起来像开销,最好考虑使用规则列表并像这样重写您的蜘蛛:>
However your rule looks like overhead because of using both callback and follow=True
just to extract links, it is better to consider using of the list of rules and rewrite your spider like this:
class MyCrawler(CrawlSpider):
name = "AnjukeCrawler"
start_urls = [
"http://www.example.com/group/"
]
rules = [
Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='multi-page']/a[@class='aNxt']"),
follow=True),
Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='li-itemmod']/div/h3/a/@href"),
callback='parse_detail_page'),
]
def parse_detail_page(self, response):
community_name = response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()
self.log(community_name, 2)
顺便说一句,链接提取器中的括号太多:restrict_xpaths=(["//div[@class='multi-page']/a[@class='aNxt']"])
BTW, too many brackets in the link extractor: restrict_xpaths=(["//div[@class='multi- page']/a[@class='aNxt']"])
这篇关于如何使用scrapy框架抓取二级网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!