scrapy - 每个 starurl 的单独输出文件 [英] scrapy - seperate output file per starurl

查看:67
本文介绍了scrapy - 每个 starurl 的单独输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行良好的爬虫:

I have this scrapy spider that runs well:

`# -*- coding: utf-8 -*-
import scrapy


class AllCategoriesSpider(scrapy.Spider):
    name = 'vieles'
    allowed_domains = ['examplewiki.de']
    start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]

#"Titel": :

def parse(self, response):
    urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url,callback=self.parse_details)

def parse_details(self,response):
    yield {
        "Titel": response.css("li.active.last::text").extract(),
        "Content": response.css('div.ce_text.first.last.block').extract(),
    }

`与

scrapy runpider spider.py -o dat.json它将所有信息保存到 dat.json

scrapy runspider spider.py -o dat.json it saves all info to dat.json

我希望每个起始 url categoryA.json categoryB.json 等都有一个输出文件.

I whould like to have a outputfile per start url categoryA.json categoryB.json and so on.

一个 类似的问题 没有得到答复,我无法重现 这个答案 我无法从 建议.

A similar question has been left unanswered, I cannot reproduce this answer and I am not able to learn form the suggestions there.

我如何实现拥有多个输出文件的目标,每个 starturl 一个?我只想运行一个命令/shellscript/文件来实现这一点.

How do I achive the goal of having several outputfiles, one per starturl? I whould like to only run one command/shellscript/file to achive this.

推荐答案

您没有在代码中使用真实的 url,所以我使用我的页面进行测试.
我必须更改 css 选择器,并且使用了不同的字段.

You didn't use real urls in code so I use my page for test.
I have to changed css selectors and I used different fields.

我将其另存为 csv 因为它更容易附加数据.
JSON 需要从文件中读取所有项目,添加新项目并将所有项目再次保存在同一文件中.

I save it as csv because it is easier to append data.
JSON would need to read all items from file, add new item and save all items again in the same file.

我创建了额外的字段 Category 以便稍后将其用作管道中的文件名

I create extra field Category to use it later as filename in pipeline

items.py

import scrapy

class CategoryItem(scrapy.Item):
    Title = scrapy.Field()
    Date = scrapy.Field()
    # extra field use later as filename 
    Category = scrapy.Field()

在蜘蛛中,我从 url 获取类别并使用 Request 中的 meta 发送到 parse_details.
parse_details 中,我将 category 添加到 Item.

In spider I get category from url and send to parse_details using meta in Request.
In parse_details I add category to Item.

spider/example.py

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['blog.furas.pl']
    start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']

    def parse(self, response):

        # get category from url
        category = response.url.split('/')[-1][:-5]

        urls = response.css('article a::attr(href)').extract() # links to den subpages

        for url in urls:
            # skip some urls
            if ('/tag/' not in url) and ('/category/' not in url):
                url = response.urljoin(url)
                # add category (as meta) to send it to callback function
                yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})

    def parse_details(self, response):

        # get category
        category = response.meta['category']

        # get only first title (or empty string '') and strip it
        title = response.css('h1.entry-title a::text').extract_first('')
        title = title.strip()

        # get only first date (or empty string '') and strip it
        date = response.css('.published::text').extract_first('')
        date = date.strip()

        yield {
            'Title': title,
            'Date': date,
            'Category': category,
        }

在管道中我得到 category 并使用它来打开文件以追加和保存项目.

In pipeline I get category and use it to open file for appending and save item.

pipelines.py

import csv

class CategoryPipeline(object):

    def process_item(self, item, spider):

        # get category and use it as filename
        filename = item['Category'] + '.csv'

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements 
            row = [item['Title'], item['Date']]
            writer.writerow(row)

            #write all data in row
            #warning: item is dictionary so item.values() don't have to return always values in the same order
            #writer.writerow(item.values())

        return item

在设置中,我必须取消注释管道才能激活它.

In settings I have to uncomment pipelines to activate it.

settings.py

ITEM_PIPELINES = {
    'category.pipelines.CategoryPipeline': 300,
}

<小时>

GitHub 上的完整代码:python-examples/scrapy/save-categories-in-separated-files

顺便说一句:我认为您可以直接在 parse_details 中写入文件.

BTW: I think you could write in files directly in parse_details.

这篇关于scrapy - 每个 starurl 的单独输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆