使用 Scrapy 从分页页面中提取 3 级内容 [英] Extract 3-level content from paginated pages with Scrapy

查看:33
本文介绍了使用 Scrapy 从分页页面中提取 3 级内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个没有分页的种子网址(比如 DOMAIN/manufacturers.php),如下所示:

<头><title></title><身体><div class="st-text"><table cellpacing="6" width="600"><tr><td><a href="manufacturer1-type-59.php"></a></td><td><a href="manufacturer1-type-59.php">名称 1</a></td><td><a href="manufacturer2-type-5.php"></a></td><td><a href="manufacturer2-type-5.php">名称 2</a></td></tr><tr><td><a href="manufacturer3-type-88.php"></a></td><td><a href="manufacturer3-type-88.php">名称3</a></td><td><a href="manufacturer4-type-76.php"></a></td><td><a href="manufacturer4-type-76.php">名称4</a></td></tr><tr><td><a href="manufacturer5-type-28.php"></a></td><td><a href="manufacturer5-type-28.php">名称 5</a></td><td><a href="manufacturer6-type-48.php"></a></td><td><a href="manufacturer6-type-48.php">名称6</a></td></tr>

</html>

从那里我想得到所有 a['href'] 's,例如:manufacturer1-type-59.php.请注意,这些链接已经不包含 DOMAIN 前缀,所以我的猜测是我必须以某种方式添加它,或者可能不添加?

或者,我希望将链接保存在 memory(用于下一阶段),并将它们保存到 disk 以备将来参考.

每个链接的内容,例如manufacturer1-type-59.php,如下所示:

<头><title></title><身体><div class="制造商"><ul><li><a href="manufacturer1_model1_type1.php"></a><li><a href="manufacturer1_model1_type2.php"></a><li><a href="manufacturer1_model2_type3.php"></a>

<div class="nav-band"><div class="nav-items"><div class="nav-pages"><span>页面:</span><strong>1</strong><a href="manufacturer1-type-STRING-59-INT-p2.php">2</a><a href="manufacturer1-type-STRING-59-INT-p3.php">3</a><a href="manufacturer1-type-STRING-59-INT-p2.php" title="下一页">»</a>

</html>

接下来,我想获取所有a['href'] 's,例如manufacturer_model1_type1.php.再次注意,这些链接不包含域前缀.这里的另一个困难是这些页面支持分页.所以,我也想进入所有这些页面.正如预期的那样,manufacturer-type-59.php 重定向到 manufacturer-type-STRING-59-INT-p2.php.

或者,我还想将链接保存在 memory(用于下一阶段),并将它们保存到 disk 以备将来参考.

第三步也是最后一步应该是检索manufacturer_model1_type1.php类型的所有页面的内容,提取标题,并将结果以如下形式保存在一个文件中:(url, title,).

编辑

这是我到目前为止所做的但似乎不起作用...

导入scrapy从 scrapy.contrib.spiders 导入 CrawlSpider,规则从 scrapy.contrib.linkextractors 导入 LinkExtractor类 ArchiveItem(scrapy.Item):url = scrapy.Field()类 ArchiveSpider(CrawlSpider):名称 = 'gsmarena'allowed_domains = ['gsmarena.com']start_urls = ['http://www.gsmarena.com/makers.php3']规则 = [规则(LinkExtractor(allow=['\S+-phones-\d+\.php'])),规则(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),规则(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),]def parse_archive(self, response):洪流 = ArchiveItem()torrent['url'] = response.url返回洪流

解决方案

我认为你最好使用 BaseSpider 而不是 CrawlSpider

此代码可能有帮助

class GsmArenaSpider(Spider):名称 = 'gsmarena'start_urls = ['http://www.gsmarena.com/makers.php3', ]allowed_domains = ['gsmarena.com']BASE_URL = 'http://www.gsmarena.com/'定义解析(自我,响应):标记 = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()如果标记:对于标记中的标记:产量请求(url=self.BASE_URL + 标记,回调=self.parse_marker)def parse_marker(self, response):url = response.url# 提取手机网址phone = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()如果电话:对于电话中的电话:# 将回调函数名称更改为 parse_events 以进行第一次抓取产量请求(url=self.BASE_URL + 电话,回调=self.parse_phone)别的:返回# 分页next_page = response.xpath('//a[contains(@title, "下一页")]/@href').extract()如果下一页:产量请求(url=self.BASE_URL + next_page[0], callback=self.parse_marker)def parse_phone(self, response):# 提取任何你想要的东西并在这里产生项目经过

编辑

如果您想跟踪这些电话网址的来源,您可以将网址作为 metaparse 传递给 parse_phone 通过parse_marker然后请求看起来像

 yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})产量请求(url=self.BASE_URL + 电话,回调=self.parse_phone,meta={'url_level2': response.url, url_level1: response.meta['url_level1']})

I have a seed url (say DOMAIN/manufacturers.php) with no pagination that looks like this:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="st-text">
        <table cellspacing="6" width="600">
            <tr>
                <td>
                    <a href="manufacturer1-type-59.php"></a>
                </td>

                <td>
                    <a href="manufacturer1-type-59.php">Name 1</a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php"></a>
                </td>

                <td>
                    <a href="manufacturer2-type-5.php">Name 2</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer3-type-88.php"></a>
                </td>

                <td>
                    <a href="manufacturer3-type-88.php">Name 3</a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php"></a>
                </td>

                <td>
                    <a href="manufacturer4-type-76.php">Name 4</a>
                </td>
            </tr>

            <tr>
                <td>
                    <a href="manufacturer5-type-28.php"></a>
                </td>

                <td>
                    <a href="manufacturer5-type-28.php">Name 5</a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php"></a>
                </td>

                <td>
                    <a href="manufacturer6-type-48.php">Name 6</a>
                </td>
            </tr>
        </table>
    </div>
</body>
</html>

From there I would like to get all a['href'] 's, for example: manufacturer1-type-59.php. Note that these links do NOT contain the DOMAIN prefix already so my guess is that I have to add it somehow, or maybe not?

Optionally, I would like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.

The content of each of these links, such as manufacturer1-type-59.php, looks like this:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>

<body>
    <div class="makers">
        <ul>
            <li>
                <a href="manufacturer1_model1_type1.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model1_type2.php"></a>
            </li>

            <li>
                <a href="manufacturer1_model2_type3.php"></a>
            </li>
        </ul>
    </div>

    <div class="nav-band">
        <div class="nav-items">
            <div class="nav-pages">
                <span>Pages:</span><strong>1</strong>
                <a href="manufacturer1-type-STRING-59-INT-p2.php">2</a>
                <a href="manufacturer1-type-STRING-59-INT-p3.php">3</a>
                <a href="manufacturer1-type-STRING-59-INT-p2.php" title="Next page">»</a>
            </div>
        </div>
    </div>
</body>
</html>

Next, I would like to get all a['href'] 's, for example manufacturer_model1_type1.php. Again, note that these links do NOT contain the domain prefix. One additional difficulty here is that these pages support pagination. So, I would like to go into all these pages too. As expected, manufacturer-type-59.php redirects to manufacturer-type-STRING-59-INT-p2.php.

Optionally, I would also like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.

The third and final step should be to retrieve the content of all pages of type manufacturer_model1_type1.php, extract the title, and save result in a file in the following form: (url, title, ).

EDIT

This is what I have done so far but doesn't seem to work...

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class ArchiveItem(scrapy.Item):
    url = scrapy.Field()

class ArchiveSpider(CrawlSpider):
    name = 'gsmarena'
    allowed_domains = ['gsmarena.com']
    start_urls = ['http://www.gsmarena.com/makers.php3']
    rules = [
        Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
        Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
        Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
    ]

    def parse_archive(self, response):
        torrent = ArchiveItem()
        torrent['url'] = response.url
        return torrent

解决方案

I think you better use BaseSpider instead of CrawlSpider

this code might help

class GsmArenaSpider(Spider):
    name = 'gsmarena'
    start_urls = ['http://www.gsmarena.com/makers.php3', ]
    allowed_domains = ['gsmarena.com']
    BASE_URL = 'http://www.gsmarena.com/'

def parse(self, response):
    markers = response.xpath('//div[@id="mid-col"]/div/table/tr/td/a/@href').extract()
    if markers:
        for marker in markers:
            yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)

def parse_marker(self, response):
    url = response.url
    # extracting phone urls
    phones = response.xpath('//div[@class="makers"]/ul/li/a/@href').extract()
    if phones:
        for phone in phones:
            # change callback function name as parse_events for first crawl
            yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
    else:
        return

    # pagination
    next_page = response.xpath('//a[contains(@title, "Next page")]/@href').extract()
    if next_page:
        yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)

def parse_phone(self, response):
    # extract whatever stuffs you want and yield items here
    pass

EDIT

if you want to keep the track of from where these phone url's are coming, you could pass the url as meta from parse to parse_phone through parse_marker then the request will look like

 yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})

yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})

这篇关于使用 Scrapy 从分页页面中提取 3 级内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆