使用文件项目管道时如何整理scrapy的csv输出 [英] How to tidy up csv output from scrapy when using files item pipeline
问题描述
在得到 SO 社区的大量帮助后,我有一个爬虫爬虫可以保存它爬取的网站的网页,但我想清理创建的 csv 文件 --output
示例行当前看起来像
"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page
<块引用>
如何让 csv 文件包含每行 1 个文件的详细信息(没有额外的 url:)并且路径值包括 .html 或 .txt 等扩展名?
我的items.py如下
class MycrawlerItem(scrapy.Item):# 在此处为您的项目定义字段,例如:# name = scrapy.Field()标题 = scrapy.Field()crawldate = scrapy.Field()pageurl = scrapy.Field()文件 = scrapy.Field()file_urls = scrapy.Field()
我的回调规则是
def scrape_page(self,response):page_soup = BeautifulSoup(response.body,"html.parser")ScrapedPageTitle = page_soup.title.get_text()item = MycrawlerItem()item['title'] =ScrapedPageTitleitem['crawldate'] = datetime.datetime.now()item['pageurl'] = response.urlitem['file_urls'] = [response.url]产量项目
在爬虫日志中显示
2016-06-20 16:10:26 [scrapy] DEBUG:从<200 http://example.com/page>{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),'file_urls': ['http://example.com/page'],'文件':[{'校验和':'校验和值','path': '完整/散列文件','url': 'http://example.com/page'}],'pageurl': 'http://example.com/page','title': u'我的示例页面'}
每个 csv 行的理想结构是
<块引用>抓取日期、文件网址、文件路径、标题
如果您想要自定义格式等,您可能只想使用好的 ol' scrapy item 管道.
在管道方法 process_item
或 close_spider
中,您可以将项目写入文件.喜欢:
def process_item(self, item, spider):如果不是 getattr(spider, 'csv', False):归还物品with open('{}.csv'.format(spider.name), 'a') as f:作家 = csv.writer(f)writer.writerow([item['crawldate'],item['title']])归还物品
每次运行带有 csv
标志的蜘蛛时,这都会写出
文件,即 scrapy crawl twitter -a csv=True
如果您在 spider_open
方法中打开一个文件并在 spider_close
中关闭它,您可以使这更有效,但其他方面是一样的.
After alot of help from the SO community I have a scrapy crawler which saves the webpage of the site it crawls but I'd like to clean up the csv file that gets created --output
A sample row currently looks like
"[{'url': 'http://example.com/page', 'path': 'full/hashedfile', 'checksum': 'checksumvalue'}]",http://example.com/page,2016-06-20 16:10:24.824000,http://example.com/page,My Example Page
How do I get the csv file to contain details on 1 file per line (no extra url:)and the path value includes an extension like .html or .txt ?
my items.py is as follows
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
crawldate = scrapy.Field()
pageurl = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
My rules callback is
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = MycrawlerItem()
item['title'] =ScrapedPageTitle
item['crawldate'] = datetime.datetime.now()
item['pageurl'] = response.url
item['file_urls'] = [response.url]
yield item
In the crawler log it shows
2016-06-20 16:10:26 [scrapy] DEBUG: Scraped from <200 http://example.com/page>
{'crawldate': datetime.datetime(2016, 6, 20, 16, 10, 24, 824000),
'file_urls': ['http://example.com/page'],
'files': [{'checksum': 'checksumvalue',
'path': 'full/hashedfile',
'url': 'http://example.com/page'}],
'pageurl': 'http://example.com/page',
'title': u'My Example Page'}
The ideal structure for each csv line would be
crawldate,file_url,file_path,title
If you want custom formats and such you probably want to just use good ol' scrapy item pipelines.
in pipelines methods process_item
or close_spider
you can write your item to file. Like:
def process_item(self, item, spider):
if not getattr(spider, 'csv', False):
return item
with open('{}.csv'.format(spider.name), 'a') as f:
writer = csv.writer(f)
writer.writerow([item['crawldate'],item['title']])
return item
This will write out <spider_name>.csv
file every time you run the spider with csv
flag, i.e. scrapy crawl twitter -a csv=True
You can make this more efficient if you open a file in spider_open
method and close it in spider_close
, but it's the same thing otherwise.
这篇关于使用文件项目管道时如何整理scrapy的csv输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!