使用 Scrapy 从分页页面中提取 3 级内容 [英] Extract 3-level content from paginated pages with Scrapy

查看：33 发布时间：2021/7/16 22:11:39 scrapy

本文介绍了使用 Scrapy 从分页页面中提取 3 级内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个没有分页的种子网址(比如 DOMAIN/manufacturers.php)，如下所示:

<头><title></title><身体><div class="st-text"><table cellpacing="6" width="600"><tr><td><a href="manufacturer1-type-59.php"></a></td><td><a href="manufacturer1-type-59.php">名称 1</a></td><td><a href="manufacturer2-type-5.php"></a></td><td><a href="manufacturer2-type-5.php">名称 2</a></td></tr><tr><td><a href="manufacturer3-type-88.php"></a></td><td><a href="manufacturer3-type-88.php">名称3</a></td><td><a href="manufacturer4-type-76.php"></a></td><td><a href="manufacturer4-type-76.php">名称4</a></td></tr><tr><td><a href="manufacturer5-type-28.php"></a></td><td><a href="manufacturer5-type-28.php">名称 5</a></td><td><a href="manufacturer6-type-48.php"></a></td><td><a href="manufacturer6-type-48.php">名称6</a></td></tr>

</html>

从那里我想得到所有 a['href'] 's，例如:manufacturer1-type-59.php.请注意，这些链接已经不包含 DOMAIN 前缀，所以我的猜测是我必须以某种方式添加它，或者可能不添加?

或者，我希望将链接保存在 memory(用于下一阶段)，并将它们保存到 disk 以备将来参考.

每个链接的内容，例如manufacturer1-type-59.php，如下所示:

<头><title></title><身体><div class="制造商"><ul><li><a href="manufacturer1_model1_type1.php"></a><li><a href="manufacturer1_model1_type2.php"></a><li><a href="manufacturer1_model2_type3.php"></a>

<div class="nav-band"><div class="nav-items"><div class="nav-pages"><span>页面:</span><strong>1</strong><a href="manufacturer1-type-STRING-59-INT-p2.php">2</a><a href="manufacturer1-type-STRING-59-INT-p3.php">3</a><a href="manufacturer1-type-STRING-59-INT-p2.php" title="下一页">»</a>

导入scrapy从 scrapy.contrib.spiders 导入 CrawlSpider，规则从 scrapy.contrib.linkextractors 导入 LinkExtractor类 ArchiveItem(scrapy.Item):url = scrapy.Field()类 ArchiveSpider(CrawlSpider):名称 = 'gsmarena'allowed_domains = ['gsmarena.com']start_urls = ['http://www.gsmarena.com/makers.php3']规则 = [规则(LinkExtractor(allow=['\S+-phones-\d+\.php'])),规则(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),规则(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),]def parse_archive(self, response):洪流 = ArchiveItem()torrent['url'] = response.url返回洪流

class GsmArenaSpider(Spider):名称 = 'gsmarena'start_urls = ['http://www.gsmarena.com/makers.php3', ]allowed_domains = ['gsmarena.com']BASE_URL = 'http://www.gsmarena.com/'定义解析(自我，响应):标记 = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()如果标记:对于标记中的标记:产量请求(url=self.BASE_URL + 标记，回调=self.parse_marker)def parse_marker(self, response):url = response.url# 提取手机网址phone = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()如果电话:对于电话中的电话:# 将回调函数名称更改为 parse_events 以进行第一次抓取产量请求(url=self.BASE_URL + 电话，回调=self.parse_phone)别的:返回# 分页next_page = response.xpath('//a[contains(@title, "下一页")]/@href').extract()如果下一页:产量请求(url=self.BASE_URL + next_page[0], callback=self.parse_marker)def parse_phone(self, response):# 提取任何你想要的东西并在这里产生项目经过

yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})产量请求(url=self.BASE_URL + 电话，回调=self.parse_phone，meta={'url_level2': response.url, url_level1: response.meta['url_level1']})

<!DOCTYPE html> <html> <head> <title></title> </head> <body> <div class="st-text"> <table cellspacing="6" width="600"> <tr> <td> <a href="manufacturer1-type-59.php"></a> </td> <td> <a href="manufacturer1-type-59.php">Name 1</a> </td> <td> <a href="manufacturer2-type-5.php"></a> </td> <td> <a href="manufacturer2-type-5.php">Name 2</a> </td> </tr> <tr> <td> <a href="manufacturer3-type-88.php"></a> </td> <td> <a href="manufacturer3-type-88.php">Name 3</a> </td> <td> <a href="manufacturer4-type-76.php"></a> </td> <td> <a href="manufacturer4-type-76.php">Name 4</a> </td> </tr> <tr> <td> <a href="manufacturer5-type-28.php"></a> </td> <td> <a href="manufacturer5-type-28.php">Name 5</a> </td> <td> <a href="manufacturer6-type-48.php"></a> </td> <td> <a href="manufacturer6-type-48.php">Name 6</a> </td> </tr> </table> </div> </body> </html>

<!DOCTYPE html> <html> <head> <title></title> </head> <body> <div class="makers"> <ul> <li> <a href="manufacturer1_model1_type1.php"></a> </li> <li> <a href="manufacturer1_model1_type2.php"></a> </li> <li> <a href="manufacturer1_model2_type3.php"></a> </li> </ul> </div> <div class="nav-band"> <div class="nav-items"> <div class="nav-pages"> <span>Pages:</span><strong>1</strong> <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a> <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a> <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a> </div> </div> </div> </body> </html>

import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class ArchiveItem(scrapy.Item): url = scrapy.Field() class ArchiveSpider(CrawlSpider): name = 'gsmarena' allowed_domains = ['gsmarena.com'] start_urls = ['http://www.gsmarena.com/makers.php3'] rules = [ Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])), Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])), Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'), ] def parse_archive(self, response): torrent = ArchiveItem() torrent['url'] = response.url return torrent

class GsmArenaSpider(Spider): name = 'gsmarena' start_urls = ['http://www.gsmarena.com/makers.php3', ] allowed_domains = ['gsmarena.com'] BASE_URL = 'http://www.gsmarena.com/' def parse(self, response): markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract() if markers: for marker in markers: yield Request(url=self.BASE_URL + marker, callback=self.parse_marker) def parse_marker(self, response): url = response.url # extracting phone urls phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract() if phones: for phone in phones: # change callback function name as parse_events for first crawl yield Request(url=self.BASE_URL + phone, callback=self.parse_phone) else: return # pagination next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract() if next_page: yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker) def parse_phone(self, response): # extract whatever stuffs you want and yield items here pass

yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url}) yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})

使用 Scrapy 从分页页面中提取 3 级内容 [英] Extract 3-level content from paginated pages with Scrapy

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Scrapy 从分页页面中提取 3 级内容 [英] Extract 3-level content from paginated pages with Scrapy

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭