使用scrapy抓取并抓取一个完整的站点 [英] Crawl and scrape a complete site with scrapy

查看:34
本文介绍了使用scrapy抓取并抓取一个完整的站点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

导入scrapy从scrapy导入请求#scrapy 爬取jobs9 -o jobs9.csv -t csv类 JobsSpider(scrapy.Spider):name = "jobs9";allowed_domains = [vapedonia.com"]start_urls = [https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-",https://www.vapedonia.com/10-cigarrillos-electronicos-",https://www.vapedonia.com/11-mods-potencia-",https://www.vapedonia.com/12-consumibles",https://www.vapedonia.com/13-baterias",https://www.vapedonia.com/23-e-liquidos",https://www.vapedonia.com/26-accesorios",https://www.vapedonia.com/31-atomizadores-reparables",https://www.vapedonia.com/175-alquimia-",https://www.vapedonia.com/284-articulos-en-liquidacion"]定义解析(自我,响应):products = response.xpath('//div[@class="product-container clearfix"]')对于产品中的产品:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class=right_block"]/div[@class=content_price"]/span[@class=price"]/text()').extract_first().编码(utf-8")产量{'图片':图片,'链接':链接,'名称':名称,'价格':价格}relative_next_url = response.xpath('//*[@id=pagination_next"]/a/@href').extract_first()absolute_next_url = "https://www.vapedonia.com";+ str(relative_next_url)产量请求(absolute_next_url,回调=self.parse)

使用该代码,我可以正确抓取页面及其子页面的产品.已抓取所有页面.

如果我想抓取整个网站,我必须手动将类别 URL 放在start_urls"中.我应该抓取这些 url 以使抓取动态化.

除了简单的分页抓取之外,我如何才能将抓取与抓取混合使用?

谢谢.

现在,我改进了我的代码,这是新代码:

导入scrapy从scrapy导入请求从scrapy.spider导入CrawlSpider,规则从scrapy.linkextractors 导入LinkExtractor#scrapy 抓取jobs10 -o jobs10.csv -t csv类 JobsSpider(scrapy.spiders.CrawlSpider):name = "jobs10";allowed_domains = [vapedonia.com"]start_urls = [https://www.vapedonia.com/"]规则 = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )def parse_category(self, response):products = response.xpath('//div[@class="product-container clearfix"]')对于产品中的产品:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class=right_block"]/div[@class=content_price"]/span[@class=price"]/text()').extract_first().编码(utf-8")产量{'图片':图片,'链接':链接,'名称':名称,'价格':价格}

我所做的更改如下:

1- 我导入 Crawlspider、Rule 和 LinkExtractor

from scrapy.spider import CrawlSpider, Rule从scrapy.linkextractors 导入LinkExtractor

2- jobSpider 类不继承自scrapy.Spider";了.它现在继承自 scrapy.spiders.CrawlSpider(已在上一步导出)

3-starts_urls"不再由静态 url 列表组成,我们只取域名,所以

start_urls = [https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-",https://www.vapedonia.com/10-cigarrillos-electronicos-",https://www.vapedonia.com/11-mods-potencia-",https://www.vapedonia.com/12-consumibles",https://www.vapedonia.com/13-baterias",https://www.vapedonia.com/23-e-liquidos",https://www.vapedonia.com/26-accesorios",https://www.vapedonia.com/31-atomizadores-reparables",https://www.vapedonia.com/175-alquimia-",https://www.vapedonia.com/284-articulos-en-liquidacion"]

被替换为

start_urls = [https://www.vapedonia.com/"]

4- 我们制定规则

rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )

我们不调用解析"再但是 parse_category"

5- 之前的分页爬行消失.所以,下一个代码就消失了

relative_next_url = response.xpath('//*[@id=pagination_next"]/a/@href').extract_first()absolute_next_url = "https://www.vapedonia.com";+ str(relative_next_url)产量请求(absolute_next_url,回调=self.parse)

所以在我看来,这似乎很合乎逻辑,分页抓取过程被 url 抓取过程取代.

但是......它不起作用,甚至价格"都没有.使用 encode("utf-8") 的字段不再起作用.

解决方案

在这种情况下,您需要使用带有规则的 CrawlSpider.下面是一个简单翻译的你的刮板

class JobsSpider(scrapy.spiders.CrawlSpider):名称 = "jobs9"allowed_domains = ["vapedonia.com"]start_urls = ["https://www.vapedonia.com"]规则 = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )def parse_category(self, response):products = response.xpath('//div[@class="product-container clearfix"]')对于产品中的产品:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()价格 = 产品.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode(utf-8")产量{'图像':图像,'链接':链接,'名称':名称,'价格':价格}

import scrapy
from scrapy import Request

#scrapy crawl jobs9 -o jobs9.csv -t csv
class JobsSpider(scrapy.Spider):
name = "jobs9"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", 
              "https://www.vapedonia.com/10-cigarrillos-electronicos-", 
              "https://www.vapedonia.com/11-mods-potencia-", 
              "https://www.vapedonia.com/12-consumibles", 
              "https://www.vapedonia.com/13-baterias", 
              "https://www.vapedonia.com/23-e-liquidos", 
              "https://www.vapedonia.com/26-accesorios", 
              "https://www.vapedonia.com/31-atomizadores-reparables", 
              "https://www.vapedonia.com/175-alquimia-", 
              "https://www.vapedonia.com/284-articulos-en-liquidacion"]

def parse(self, response):
    products = response.xpath('//div[@class="product-container clearfix"]')
    for product in products:
        image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()
        link = product.xpath('div[@class="center_block"]/a/@href').extract_first()
        name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()
        price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")
        yield{'Image' : image, 'Link' : link, 'Name': name, 'Price': price}
        
    relative_next_url = response.xpath('//*[@id="pagination_next"]/a/@href').extract_first()
    absolute_next_url = "https://www.vapedonia.com" + str(relative_next_url)
    yield Request(absolute_next_url, callback=self.parse)

with that code, I scrape correctly the products of a page and its subpages. All pages are crawled.

If I want to scrape the whole site, I must put the categories URLs manually in "start_urls". The gppd thing should me crawl those urls to make that crawl dynamic.

How can I mix crawling with scraping beyond the simple paginated crawl?

Thank you.

Now, I improve my code, here's the new code:

import scrapy
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

#scrapy crawl jobs10 -o jobs10.csv -t csv
class JobsSpider(scrapy.spiders.CrawlSpider):
name = "jobs10"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/"]

rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )

def parse_category(self, response):
    products = response.xpath('//div[@class="product-container clearfix"]')
    for product in products:
        image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()
        link = product.xpath('div[@class="center_block"]/a/@href').extract_first()
        name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()
        price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")
        yield{'Image' : image, 'Link' : link, 'Name': name, 'Price': price}

The changes I've made are the following:

1- I import Crawlspider, Rule and LinkExtractor

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

2- the jobSpider class does not inherit from "scrapy.Spider" anymore. It now inherits from scrapy.spiders.CrawlSpider (which has been exported in the previous step)

3- "starts_urls" is not composed from a static list of urls anymore, we just take the domain name, so

start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", 
    "https://www.vapedonia.com/10-cigarrillos-electronicos-", 
    "https://www.vapedonia.com/11-mods-potencia-", 
    "https://www.vapedonia.com/12-consumibles", 
    "https://www.vapedonia.com/13-baterias", 
    "https://www.vapedonia.com/23-e-liquidos", 
    "https://www.vapedonia.com/26-accesorios", 
    "https://www.vapedonia.com/31-atomizadores-reparables", 
    "https://www.vapedonia.com/175-alquimia-", 
    "https://www.vapedonia.com/284-articulos-en-liquidacion"]

is replaced by

start_urls = ["https://www.vapedonia.com/"]

4- we put the rules

rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )

we don't call "parse" anymore "but parse_category"

5- the previous pagination crawling disappear. So, the next code simply disappear

relative_next_url = response.xpath('//*[@id="pagination_next"]/a/@href').extract_first()
absolute_next_url = "https://www.vapedonia.com" + str(relative_next_url)
yield Request(absolute_next_url, callback=self.parse)

So as I see it and it seems very logical, pagination crawling process is replaced by url crawling process.

But... it does not work and even the "price" field which worked with encode("utf-8") does not work anymore.

解决方案

You need to use a CrawlSpider with rules in this case. Below is a simple translated one of your scraper

class JobsSpider(scrapy.spiders.CrawlSpider):
    name = "jobs9"
    allowed_domains = ["vapedonia.com"]
    start_urls = ["https://www.vapedonia.com"]

    rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )

    def parse_category(self, response):
        products = response.xpath('//div[@class="product-container clearfix"]')
        for product in products:
            image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()
            link = product.xpath('div[@class="center_block"]/a/@href').extract_first()
            name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()
            price = product.xpath(
                'div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode(
                "utf-8")
            yield {'Image': image, 'Link': link, 'Name': name, 'Price': price}

Look at different spiders on https://doc.scrapy.org/en/latest/topics/spiders.html

这篇关于使用scrapy抓取并抓取一个完整的站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆