Scrapy管道以正确的格式导出csv文件 [英] Scrapy pipeline to export csv file in the right format

查看：480 发布时间：2017/2/24 19:58:16 python csv scrapy pipeline

本文介绍了Scrapy管道以正确的格式导出csv文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我根据下面的alexce的建议做了改进。我需要的是如下图所示。但是每行/每行应该是一个评论：包括日期，评分，评论文字和链接。

我需要让项目处理器处理每个页面的每个评论。 >
目前TakeFirst（）只有第一次审查的页面。所以10页，我只有10行/行，如下图所示。

蜘蛛代码如下：

  import scrapy 
 from amazon.items import AmazonItem 
 
类AmazonSpider（scrapy.Spider）：
 name =amazon
 allowed_domains = 'amazon.co.uk'] 
 start_urls = [
'http://www.amazon.co.uk/product-reviews/B0042EU3A2/'.format(page）for page in xrange（1,114 ）
 
] 
 
 def parse（self，response）：
 for sel in response.xpath（'// * [@ id =productReviews] / / tr / td [1]'）：
 item = AmazonItem（）
 item ['rating'] = sel.xpath（'div / div [2] / span [1] extract（）
 item ['date'] = sel.xpath（'div / div [2] / span [2] / nobr / text item ['review'] = sel.xpath（'div / div [6] / text（）'）。extract（）
 item ['link'] = sel.xpath（'div / div [7] / div [2] / div / div [1] / span [3] / a / @ href'）extract（）
 
 yield item 
  / pre> 
 
解决方案
我从头开始，下面的蜘蛛应该运行
 
 
   scrapy crawl amazon -t csv -o Amazon.csv --loglevel = INFO  
 
 
 打开包含电子表格的CSV文件
 
 
   
  > import scrapy 
 
 class AmazonItem（scrapy.Item）：
 rating = scrapy.Field（）
 date = scrapy.Field（）
 review = scrapy.Field （）
 link = scrapy.Field（）
 
 class AmazonSpider（scrapy.Spider）：
 
 name =amazon
 allowed_domains = amazon.co.uk'] 
 start_urls = ['http://www.amazon.co.uk/product-reviews/B0042EU3A2/'] 
 
 def parse（self，response） ：
 
 for sel in response.xpath（'// table [@ id =productReviews] // tr / td / div'）：
 
 item = AmazonItem ）
 item ['rating'] = sel.xpath（'./ div / span / span / span / text（）'）。extract（）
 item ['date'] = sel.xpath （'./div/span/nobr/text（）'）。extract（）
 item ['review'] = sel.xpath（'./ div [@ class =reviewText] / text '）.extract（）
 item ['link'] = sel.xpath（'.// a [contains（。，Permalink）] / @ href'）。 item 
 
 xpath_Next_Page ='.//table[@=\"_productReviews\"]/following::*//span[@class=\"paging\"]/a[contains(.,\"Next ）] / @ href'
 if response.xpath（xpath_Next_Page）：
 url_Next_Page = response.xpath（xpath_Next_Page）.extract（）[0] 
 request = scrapy.Request（url_Next_Page，callback = self.parse）
 yield request 
  
 
I made the improvement according to the suggestion from alexce below. What I need is like the picture below. However each row/line should be one review: with date, rating, review text and link.

I need to let item processor process each review of every page.

Currently TakeFirst() only takes the first review of the page. So 10 pages, I only have 10 lines/rows as in the picture below.



Spider code is below:
import scrapy
from amazon.items import AmazonItem

class AmazonSpider(scrapy.Spider):
   name = "amazon"
   allowed_domains = ['amazon.co.uk']
   start_urls = [
    'http://www.amazon.co.uk/product-reviews/B0042EU3A2/'.format(page) for      page in xrange(1,114)

]

def parse(self, response):
    for sel in response.xpath('//*[@id="productReviews"]//tr/td[1]'):
        item = AmazonItem()
        item['rating'] = sel.xpath('div/div[2]/span[1]/span/@title').extract()
        item['date'] = sel.xpath('div/div[2]/span[2]/nobr/text()').extract()
        item['review'] = sel.xpath('div/div[6]/text()').extract()
        item['link'] = sel.xpath('div/div[7]/div[2]/div/div[1]/span[3]/a/@href').extract()

        yield item

 解决方案 
I started from scratch and the following spider should be run with

scrapy crawl amazon -t csv -o Amazon.csv --loglevel=INFO

so that opening the CSV-File with a spreadsheet shows for me



Hope this helps :-)
import scrapy

class AmazonItem(scrapy.Item):
    rating = scrapy.Field()
    date = scrapy.Field()
    review = scrapy.Field()
    link = scrapy.Field()

class AmazonSpider(scrapy.Spider):

    name = "amazon"
    allowed_domains = ['amazon.co.uk']
    start_urls = ['http://www.amazon.co.uk/product-reviews/B0042EU3A2/' ]

    def parse(self, response):

        for sel in response.xpath('//table[@id="productReviews"]//tr/td/div'):

            item = AmazonItem()
            item['rating'] = sel.xpath('./div/span/span/span/text()').extract()
            item['date'] = sel.xpath('./div/span/nobr/text()').extract()
            item['review'] = sel.xpath('./div[@class="reviewText"]/text()').extract()
            item['link'] = sel.xpath('.//a[contains(.,"Permalink")]/@href').extract()
            yield item

        xpath_Next_Page = './/table[@id="productReviews"]/following::*//span[@class="paging"]/a[contains(.,"Next")]/@href'
        if response.xpath(xpath_Next_Page):
            url_Next_Page = response.xpath(xpath_Next_Page).extract()[0]
            request = scrapy.Request(url_Next_Page, callback=self.parse)
            yield request


                        
这篇关于Scrapy管道以正确的格式导出csv文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy管道以正确的格式导出csv文件 [英] Scrapy pipeline to export csv file in the right format

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy管道以正确的格式导出csv文件 [英] Scrapy pipeline to export csv file in the right format

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭