Scrapy python csv 输出在每行之间有空行 [英] Scrapy python csv output has blank lines between each row
问题描述
我在生成的 csv 输出文件中每行scrapy 输出之间出现不需要的空行.
I am getting unwanted blank lines between each row of scrapy output in the resulting csv output file.
我已从 python2 迁移到 python 3,并且使用 Windows 10.因此,我正在为 python3 调整我的 scrapy 项目.
I have moved from python2 to python 3, and I use Windows 10. I am therefore in the process of adapting my scrapy projects for python3.
我目前(目前唯一的)问题是,当我将 scrapy 输出写入 CSV 文件时,每行之间有一个空行.这已在此处的几篇文章中突出显示(与 Windows 相关),但我无法找到可行的解决方案.
My current (and for now, sole) problem is that when I write the scrapy output to a CSV file I get a blank line between each row. This has been highlighted on several posts here (it is to do with Windows) but I am unable to get a solution to work.
碰巧的是,我还在 piplines.py 文件中添加了一些代码,以确保 csv 输出按给定的列顺序排列,而不是一些随机顺序.因此,我可以使用普通的 scrapy crawl charleschurch
来运行此代码而不是 scrapy crawl charleschurch -o charleschurch2017xxxx.csv
As it happens, I have also added some code to the piplines.py file to ensure the csv output is in a given column order and not some random order. Hence, I can use the normal scrapy crawl charleschurch
to run this code rather than the scrapy crawl charleschurch -o charleschurch2017xxxx.csv
有谁知道如何在 CSV 输出中跳过/省略这个空行?
Does anyone know how to skip / omit this blank line in the CSV output?
我的 pipelines.py 代码如下(我可能不需要 import csv
行,但我怀疑我可以为最终答案做):
My pipelines.py code is below (I perhaps don't need the import csv
line but I suspect I may do for the final answer):
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
我将此行添加到 settings.py 文件中(不确定 300 的相关性):
I added this line to the settings.py file (not sure the relevance of the 300):
ITEM_PIPELINES = {'CharlesChurch.pipelines.CSVPipeline': 300 }
我的scrapy代码如下:
my scrapy code is below:
import scrapy
from urllib.parse import urljoin
from CharlesChurch.items import CharleschurchItem
class charleschurchSpider(scrapy.Spider):
name = "charleschurch"
allowed_domains = ["charleschurch.com"]
start_urls = ["https://www.charleschurch.com/county-durham_willington/the-ridings-1111"]
def parse(self, response):
for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
item = CharleschurchItem()
item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
plotnames = [plotname.strip() for plotname in plotnames]
plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
plotids = [plotid.strip() for plotid in plotids]
plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
plotprices = [plotprice.strip() for plotprice in plotprices]
result = zip(plotnames, plotids, plotprices)
for plotname, plotid, plotprice in result:
item['plotname'] = plotname
item['plotid'] = plotid
item['plotprice'] = plotprice
yield item
推荐答案
我怀疑不太理想,但我找到了解决此问题的方法.在 pipelines.py 文件中,我添加了更多代码,这些代码基本上将带有空行的 csv 文件读取到列表中,因此删除了空行,然后将清理后的列表写入新文件.
i suspect not ideal but I have found a work around to this problem. In the pipelines.py file I have added more code that essentially reads the csv file with the blank lines to a list, and so removes the blank lines and then writes that cleaned list to a new file.
我添加的代码是:
with open('%s_items.csv' % spider.name, 'r') as f:
reader = csv.reader(f)
original_list = list(reader)
cleaned_list = list(filter(None,original_list))
with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
wr = csv.writer(output_file, dialect='excel')
for data in cleaned_list:
wr.writerow(data)
所以整个 pipelines.py 文件是:
and so the entire pipelines.py file is:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
#given I am using Windows i need to elimate the blank lines in the csv file
print("Starting csv blank line cleaning")
with open('%s_items.csv' % spider.name, 'r') as f:
reader = csv.reader(f)
original_list = list(reader)
cleaned_list = list(filter(None,original_list))
with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
wr = csv.writer(output_file, dialect='excel')
for data in cleaned_list:
wr.writerow(data)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
class CharleschurchPipeline(object):
def process_item(self, item, spider):
return item
不理想,但暂时解决了问题.
not ideal but solves the problem for now.
这篇关于Scrapy python csv 输出在每行之间有空行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!