Scrapy使用CSS提取数据并将excel导出到一个单元格中 [英] Scrapy using CSS to extract data and excel export everything into one cell
问题描述
这里是蜘蛛
import scrapy
import re
from ..items import HomedepotSpiderItem
class HomedepotcrawlSpider(scrapy.Spider):
name = 'homeDepotCrawl'
allowed_domains = ['homedepot.com']
start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default']
def parse(self, response):
items = HomedepotSpiderItem()
#get model
productName = response.css('.pod-plp__description.js-podclick-analytics').css('::text').getall()
productName = [x.strip(' ') for x in productName if len(x.strip())]
productName = [x.strip('\n') for x in productName if len(x.strip())]
productName = [x.strip('\t') for x in productName if len(x.strip())]
productName = [x.strip(',') for x in productName if len(x.strip())]
#productName = productName[0].split(',') tried to split the list into indiviudal elements
productSKU = response.css('.pod-plp__model::text').getall()
#get rid of all the stuff i dont need
productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
productSKU = [x.strip('\n') for x in productSKU]
productSKU = [x.strip('\t') for x in productSKU]
productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name
productSKU = [x.strip('\xa0') for x in productSKU] #gets rid of the model name
#get the price
productPrice = response.css('.price__numbers::text').getall()
#get rid of all the stuff i dont need
productPrice = [x.strip(' ') for x in productPrice if len(x.strip())]
productPrice = [x.strip('\n') for x in productPrice if len(x.strip())]
productPrice = [x.strip('\t') for x in productPrice if len(x.strip())]
productPrice = [x.strip('$') for x in productPrice if len(x.strip())]
## All prices are printing out twice, so take every other price
productPrice = productPrice[::2]
items['productName'] = productName
items['productSKU'] = productSKU
items['productPrice'] = productPrice
yield items
Items.py
import scrapy
class HomedepotSpiderItem(scrapy.Item):
#create items
productName = scrapy.Field()
productSKU = scrapy.Field()
productPrice = scrapy.Field()
#prodcutNumRating = scrapy.Field()
pass
我的问题
我正在与Scrapy一起练习,我提取了所有这些数据都来自使用CSS的Home Depot网站。解压缩后,我手动剥离了所有不需要的数据,并且在终端上看起来还不错。 但是,将所有内容导出到excel后,我提取的所有数据都打印到每行一列中。例如:产品名称->所有模型都放在一个单元格中。我查看了一些麻烦的文档,并发现.getall()将所有内容作为列表返回,因此我尝试将列表拆分为单个元素,以为可以,但是,这将摆脱我抓取的所有数据。
I'm doing some practice with Scrapy right now and I extracted all of this data from Home Depot's website using CSS. After extracting I manually stripped off everything all the data that I didn't need and it looked fine on the terminal. However, after exporting everything to excel, All my extracted data is printing out into one column per row. Ex: Product Name -> all models going into one cell. I looked into some scrapy documentation and saw that .getall() returns everything as a list so I tried splitting the list into individual elements thinking it would be fine, however that would get rid of all the data that I scraped.
任何帮助将不胜感激,让我知道是否需要任何澄清!
Any help would be appreciated and let me know if there is any clarification that is needed!
编辑
我使用以下方式导出到excel:scrapy爬行homeDepotCrawl -o test.csv -t csv
Edit I'm exporting to excel using: scrapy crawl homeDepotCrawl -o test.csv -t csv
推荐答案
问题是您要将所有项目加载到一个scrapy.Item实例中。有关更多详细信息,请参见代码注释。
The problem is you are loading all items into one scrapy.Item instance. See code comments for more details.
此外,值得注意的是,您可以使用项目加载程序或创建项目管道来清理字段,而不必重复太多代码。当处理单个项目时,您将不需要使用太多的列表理解。即使您可以调用一个简单的函数来运行它们,也比完成所有这些列表理解要好。
Also, it is worth noting that you can use item loaders or create an item pipeline to clean the fields instead of repeating so much code. When dealing with a single item you will not need to use so much list comprehension. Even a simple function you can call to run them through would be better than doing all this list comprehension.
[1] https://docs.scrapy.org/en/latest/topics/loaders.html
[2] https:// docs.scrapy.org/en/latest/topics/item-pipeline.html
[3] https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
import re
from ..items import HomedepotSpiderItem
class HomedepotcrawlSpider(scrapy.Spider):
name = 'homeDepotCrawl'
allowed_domains = ['homedepot.com']
start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default']
def parse(self, response):
'''
Notice when we set items variable we are not using .get or .extract yet
We collect the top level of each item into a list of selectors.
Then we loop through the selectors creating a new scrapy.Item instance for each selector/item on the page.
The for product in items loop will step through each item selector individually.
You can then chain .css to your variable product.css now to access each section of each
item individually and export them separately.
This will give you a new row for each item.
'''
items = response.css('.plp-pod')
for product in items:
# Create new scrapy.Item for each product in our selector list.
item = HomedepotSpiderItem()
item['productName'] = product.css('.pod-plp__description.js-podclickanalytics::text').get()
# Notice we are yielding item inside of the loop.
yield item
这篇关于Scrapy使用CSS提取数据并将excel导出到一个单元格中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!