使用 BaseSpider 编写一个爬虫来解析 scrapy 中的站点 [英] Writing a crawler to parse a site in scrapy using BaseSpider
问题描述
我对如何设计爬虫的架构感到困惑.
I am getting confused on how to design the architecure of crawler.
我有搜索的地方
- 分页:要关注的下一页链接
- 一页上的产品列表
- 要抓取单个链接以获取描述
我有以下代码:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[@id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
问题是我有两个返回语句:一个用于 request
,另一个用于 yield
.
The problem is that I have two return statements: one for request
, and one for yield
.
在爬行蜘蛛中,我不需要使用最后一个 yield
,所以一切正常,但在 BaseSpider
中,我必须手动跟踪链接.
In the crawl spider, I don't need to use the last yield
, so everything was working fine, but in BaseSpider
I have to follow links manually.
我该怎么办?
推荐答案
作为最初的通行证(并且根据您关于想要自己执行此操作的评论),我建议您查看 CrawlSpider 代码 以了解如何实现其功能.
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.
这篇关于使用 BaseSpider 编写一个爬虫来解析 scrapy 中的站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!