Scrapy管道以正确的格式导出csv文件 [英] Scrapy pipeline to export csv file in the right format
问题描述
我根据下面的alexce的建议做了改进。我需要的是如下图所示。但是每行/每行应该是一个评论:包括日期,评分,评论文字和链接。
我需要让项目处理器处理每个页面的每个评论。 >
目前TakeFirst()只有第一次审查的页面。所以10页,我只有10行/行,如下图所示。
蜘蛛代码如下:
import scrapy
/ pre>
from amazon.items import AmazonItem
类AmazonSpider(scrapy.Spider):
name =amazon
allowed_domains = 'amazon.co.uk']
start_urls = [
'http://www.amazon.co.uk/product-reviews/B0042EU3A2/'.format(page)for page in xrange(1,114 )
]
def parse(self,response):
for sel in response.xpath('// * [@ id =productReviews] / / tr / td [1]'):
item = AmazonItem()
item ['rating'] = sel.xpath('div / div [2] / span [1] extract()
item ['date'] = sel.xpath('div / div [2] / span [2] / nobr / text item ['review'] = sel.xpath('div / div [6] / text()')。extract()
item ['link'] = sel.xpath('div / div [7] / div [2] / div / div [1] / span [3] / a / @ href')extract()
yield item
解决方案我从头开始,下面的蜘蛛应该运行
scrapy crawl amazon -t csv -o Amazon.csv --loglevel = INFO
打开包含电子表格的CSV文件
> import scrapy
class AmazonItem(scrapy.Item):
rating = scrapy.Field()
date = scrapy.Field()
review = scrapy.Field ()
link = scrapy.Field()
class AmazonSpider(scrapy.Spider):
name =amazon
allowed_domains = amazon.co.uk']
start_urls = ['http://www.amazon.co.uk/product-reviews/B0042EU3A2/']
def parse(self,response) :
for sel in response.xpath('// table [@ id =productReviews] // tr / td / div'):
item = AmazonItem )
item ['rating'] = sel.xpath('./ div / span / span / span / text()')。extract()
item ['date'] = sel.xpath ('./div/span/nobr/text()')。extract()
item ['review'] = sel.xpath('./ div [@ class =reviewText] / text ').extract()
item ['link'] = sel.xpath('.// a [contains(。,Permalink)] / @ href')。 item
xpath_Next_Page ='.//table[@=\"_productReviews\"]/following::*//span[@class=\"paging\"]/a[contains(.,\"Next )] / @ href'
if response.xpath(xpath_Next_Page):
url_Next_Page = response.xpath(xpath_Next_Page).extract()[0]
request = scrapy.Request(url_Next_Page,callback = self.parse)
yield request
I made the improvement according to the suggestion from alexce below. What I need is like the picture below. However each row/line should be one review: with date, rating, review text and link.
I need to let item processor process each review of every page.
Currently TakeFirst() only takes the first review of the page. So 10 pages, I only have 10 lines/rows as in the picture below.Spider code is below:
import scrapy from amazon.items import AmazonItem class AmazonSpider(scrapy.Spider): name = "amazon" allowed_domains = ['amazon.co.uk'] start_urls = [ 'http://www.amazon.co.uk/product-reviews/B0042EU3A2/'.format(page) for page in xrange(1,114) ] def parse(self, response): for sel in response.xpath('//*[@id="productReviews"]//tr/td[1]'): item = AmazonItem() item['rating'] = sel.xpath('div/div[2]/span[1]/span/@title').extract() item['date'] = sel.xpath('div/div[2]/span[2]/nobr/text()').extract() item['review'] = sel.xpath('div/div[6]/text()').extract() item['link'] = sel.xpath('div/div[7]/div[2]/div/div[1]/span[3]/a/@href').extract() yield item
解决方案I started from scratch and the following spider should be run with
scrapy crawl amazon -t csv -o Amazon.csv --loglevel=INFO
so that opening the CSV-File with a spreadsheet shows for me
Hope this helps :-)
import scrapy class AmazonItem(scrapy.Item): rating = scrapy.Field() date = scrapy.Field() review = scrapy.Field() link = scrapy.Field() class AmazonSpider(scrapy.Spider): name = "amazon" allowed_domains = ['amazon.co.uk'] start_urls = ['http://www.amazon.co.uk/product-reviews/B0042EU3A2/' ] def parse(self, response): for sel in response.xpath('//table[@id="productReviews"]//tr/td/div'): item = AmazonItem() item['rating'] = sel.xpath('./div/span/span/span/text()').extract() item['date'] = sel.xpath('./div/span/nobr/text()').extract() item['review'] = sel.xpath('./div[@class="reviewText"]/text()').extract() item['link'] = sel.xpath('.//a[contains(.,"Permalink")]/@href').extract() yield item xpath_Next_Page = './/table[@id="productReviews"]/following::*//span[@class="paging"]/a[contains(.,"Next")]/@href' if response.xpath(xpath_Next_Page): url_Next_Page = response.xpath(xpath_Next_Page).extract()[0] request = scrapy.Request(url_Next_Page, callback=self.parse) yield request
这篇关于Scrapy管道以正确的格式导出csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!