scrapy - 每个 starurl 的单独输出文件 [英] scrapy - seperate output file per starurl
问题描述
我有一个运行良好的爬虫:
I have this scrapy spider that runs well:
`# -*- coding: utf-8 -*-
import scrapy
class AllCategoriesSpider(scrapy.Spider):
name = 'vieles'
allowed_domains = ['examplewiki.de']
start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]
#"Titel": :
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_details)
def parse_details(self,response):
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.last.block').extract(),
}
`与
scrapy runpider spider.py -o dat.json它将所有信息保存到 dat.json
scrapy runspider spider.py -o dat.json it saves all info to dat.json
我希望每个起始 url categoryA.json categoryB.json 等都有一个输出文件.
I whould like to have a outputfile per start url categoryA.json categoryB.json and so on.
一个 类似的问题 没有得到答复,我无法重现 这个答案 我无法从 建议.
A similar question has been left unanswered, I cannot reproduce this answer and I am not able to learn form the suggestions there.
我如何实现拥有多个输出文件的目标,每个 starturl 一个?我只想运行一个命令/shellscript/文件来实现这一点.
How do I achive the goal of having several outputfiles, one per starturl? I whould like to only run one command/shellscript/file to achive this.
推荐答案
您没有在代码中使用真实的 url,所以我使用我的页面进行测试.
我必须更改 css 选择器,并且使用了不同的字段.
You didn't use real urls in code so I use my page for test.
I have to changed css selectors and I used different fields.
我将其另存为 csv
因为它更容易附加数据.JSON
需要从文件中读取所有项目,添加新项目并将所有项目再次保存在同一文件中.
I save it as csv
because it is easier to append data.
JSON
would need to read all items from file, add new item and save all items again in the same file.
我创建了额外的字段 Category
以便稍后将其用作管道中的文件名
I create extra field Category
to use it later as filename in pipeline
items.py
import scrapy
class CategoryItem(scrapy.Item):
Title = scrapy.Field()
Date = scrapy.Field()
# extra field use later as filename
Category = scrapy.Field()
在蜘蛛中,我从 url 获取类别并使用 Request
中的 meta
发送到 parse_details
.
在 parse_details
中,我将 category
添加到 Item
.
In spider I get category from url and send to parse_details
using meta
in Request
.
In parse_details
I add category
to Item
.
spider/example.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['blog.furas.pl']
start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']
def parse(self, response):
# get category from url
category = response.url.split('/')[-1][:-5]
urls = response.css('article a::attr(href)').extract() # links to den subpages
for url in urls:
# skip some urls
if ('/tag/' not in url) and ('/category/' not in url):
url = response.urljoin(url)
# add category (as meta) to send it to callback function
yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})
def parse_details(self, response):
# get category
category = response.meta['category']
# get only first title (or empty string '') and strip it
title = response.css('h1.entry-title a::text').extract_first('')
title = title.strip()
# get only first date (or empty string '') and strip it
date = response.css('.published::text').extract_first('')
date = date.strip()
yield {
'Title': title,
'Date': date,
'Category': category,
}
在管道中我得到 category
并使用它来打开文件以追加和保存项目.
In pipeline I get category
and use it to open file for appending and save item.
pipelines.py
import csv
class CategoryPipeline(object):
def process_item(self, item, spider):
# get category and use it as filename
filename = item['Category'] + '.csv'
# open file for appending
with open(filename, 'a') as f:
writer = csv.writer(f)
# write only selected elements
row = [item['Title'], item['Date']]
writer.writerow(row)
#write all data in row
#warning: item is dictionary so item.values() don't have to return always values in the same order
#writer.writerow(item.values())
return item
在设置中,我必须取消注释管道才能激活它.
In settings I have to uncomment pipelines to activate it.
settings.py
ITEM_PIPELINES = {
'category.pipelines.CategoryPipeline': 300,
}
<小时>
GitHub 上的完整代码:python-examples/scrapy/save-categories-in-separated-files
顺便说一句:我认为您可以直接在 parse_details
中写入文件.
BTW: I think you could write in files directly in parse_details
.
这篇关于scrapy - 每个 starurl 的单独输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!