Scrapy使用CSS提取数据并将excel导出到一个单元格中 [英] Scrapy using CSS to extract data and excel export everything into one cell

查看:73
本文介绍了Scrapy使用CSS提取数据并将excel导出到一个单元格中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是蜘蛛

import scrapy
import re

from ..items import HomedepotSpiderItem



class HomedepotcrawlSpider(scrapy.Spider):
    name = 'homeDepotCrawl'
    allowed_domains = ['homedepot.com']
    start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default']



    def parse(self, response):

        items = HomedepotSpiderItem()

        #get model
        productName = response.css('.pod-plp__description.js-podclick-analytics').css('::text').getall()

        productName = [x.strip(' ') for x in productName if len(x.strip())] 
        productName = [x.strip('\n') for x in productName if len(x.strip())] 
        productName = [x.strip('\t') for x in productName if len(x.strip())] 
        productName = [x.strip(',') for x in productName if len(x.strip())] 

        #productName = productName[0].split(',') tried to split the list into indiviudal elements


        productSKU = response.css('.pod-plp__model::text').getall()

        #get rid of all the stuff i dont need
        productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
        productSKU = [x.strip('\n') for x in productSKU] 
        productSKU = [x.strip('\t') for x in productSKU] 
        productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name 
        productSKU = [x.strip('\xa0') for x in productSKU] #gets rid of the model name 


        #get the price
        productPrice = response.css('.price__numbers::text').getall()

        #get rid of all the stuff i dont need
        productPrice = [x.strip(' ') for x in productPrice if len(x.strip())] 
        productPrice = [x.strip('\n') for x in productPrice if len(x.strip())] 
        productPrice = [x.strip('\t') for x in productPrice if len(x.strip())] 
        productPrice = [x.strip('$') for x in productPrice if len(x.strip())] 

        ## All prices are printing out twice, so take every other price
        productPrice = productPrice[::2]



        items['productName'] = productName
        items['productSKU'] = productSKU
        items['productPrice'] = productPrice

        yield items

Items.py

import scrapy


class HomedepotSpiderItem(scrapy.Item):
     #create items
     productName = scrapy.Field()
     productSKU = scrapy.Field()
     productPrice = scrapy.Field()
     #prodcutNumRating = scrapy.Field()

     pass

我的问题

我正在与Scrapy一起练习,我提取了所有这些数据都来自使用CSS的Home Depot网站。解压缩后,我手动剥离了所有不需要的数据,并且在终端上看起来还不错。 但是,将所有内容导出到excel后,我提取的所有数据都打印到每行一列中。例如:产品名称->所有模型都放在一个单元格中。我查看了一些麻烦的文档,并发现.getall()将所有内容作为列表返回,因此我尝试将列表拆分为单个元素,以为可以,但是,这将摆脱我抓取的所有数据。

I'm doing some practice with Scrapy right now and I extracted all of this data from Home Depot's website using CSS. After extracting I manually stripped off everything all the data that I didn't need and it looked fine on the terminal. However, after exporting everything to excel, All my extracted data is printing out into one column per row. Ex: Product Name -> all models going into one cell. I looked into some scrapy documentation and saw that .getall() returns everything as a list so I tried splitting the list into individual elements thinking it would be fine, however that would get rid of all the data that I scraped.

任何帮助将不胜感激,让我知道是否需要任何澄清!

Any help would be appreciated and let me know if there is any clarification that is needed!

编辑
我使用以下方式导出到excel:scrapy爬行homeDepotCrawl -o test.csv -t csv

Edit I'm exporting to excel using: scrapy crawl homeDepotCrawl -o test.csv -t csv

推荐答案

问题是您要将所有项目加载到一个scrapy.Item实例中。有关更多详细信息,请参见代码注释。

The problem is you are loading all items into one scrapy.Item instance. See code comments for more details.

此外,值得注意的是,您可以使用项目加载程序或创建项目管道来清理字段,而不必重复太多代码。当处理单个项目时,您将不需要使用太多的列表理解。即使您可以调用一个简单的函数来运行它们,也比完成所有这些列表理解要好。

Also, it is worth noting that you can use item loaders or create an item pipeline to clean the fields instead of repeating so much code. When dealing with a single item you will not need to use so much list comprehension. Even a simple function you can call to run them through would be better than doing all this list comprehension.

[1] https://docs.scrapy.org/en/latest/topics/loaders.html

[2] https:// docs.scrapy.org/en/latest/topics/item-pipeline.html

[3] https://docs.scrapy.org/en/latest/topics/items.html

import scrapy
import re

from ..items import HomedepotSpiderItem

class HomedepotcrawlSpider(scrapy.Spider):
    name = 'homeDepotCrawl'
    allowed_domains = ['homedepot.com']
    start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default']


def parse(self, response):
    '''
    Notice when we set items variable we are not using .get or .extract yet
    We collect the top level of each item into a list of selectors. 
    Then we loop through the selectors creating a new scrapy.Item instance for each selector/item on the page. 
    The for product in items loop will step through each item selector individually.
    You can then chain .css to your variable product.css now to access each section of each 
    item individually and export them separately. 
    This will give you a new row for each item.
    '''
    items = response.css('.plp-pod')
    for product in items:
        # Create new scrapy.Item for each product in our selector list.
        item = HomedepotSpiderItem()
        item['productName'] = product.css('.pod-plp__description.js-podclickanalytics::text').get()
        # Notice we are yielding item inside of the loop.
        yield item

这篇关于Scrapy使用CSS提取数据并将excel导出到一个单元格中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆