使用 BaseSpider 编写一个爬虫来解析 scrapy 中的站点 [英] Writing a crawler to parse a site in scrapy using BaseSpider

查看：39 发布时间：2021/7/16 22:20:16 python scrapy

本文介绍了使用 BaseSpider 编写一个爬虫来解析 scrapy 中的站点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对如何设计爬虫的架构感到困惑.

I am getting confused on how to design the architecure of crawler.

我有搜索的地方

分页:要关注的下一页链接
一页上的产品列表
要抓取单个链接以获取描述

我有以下代码:

def parse_page(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ol[@id=\'result-set\']/li')
    items = []
    for site in sites[:2]:

        item = MyProduct()
        item['product'] = myfilter(site.select('h2/a').select("string()").extract())
        item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
        if  item['profile_link']:
                      request =  Request(urljoin('http://www.example.com', item['product_link']),
                      callback = self.parseItemDescription)

        request.meta['item'] = item
        return request

    soup = BeautifulSoup(response.body)
    mylinks= soup.find_all("a", text="Next")
    nextlink = mylinks[0].get('href')
    yield Request(urljoin(response.url, nextlink), callback=self.parse_page)

问题是我有两个返回语句:一个用于 request，另一个用于 yield.

The problem is that I have two return statements: one for request, and one for yield.

在爬行蜘蛛中，我不需要使用最后一个 yield，所以一切正常，但在 BaseSpider 中，我必须手动跟踪链接.

In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.

我该怎么办?

使用 BaseSpider 编写一个爬虫来解析 scrapy 中的站点 [英] Writing a crawler to parse a site in scrapy using BaseSpider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 BaseSpider 编写一个爬虫来解析 scrapy 中的站点 [英] Writing a crawler to parse a site in scrapy using BaseSpider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭