如何将scrapy spider中的项目追加到列表中? [英] How to append items from scrapy spider to list?

查看:125
本文介绍了如何将scrapy spider中的项目追加到列表中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一个基本的蜘蛛程序,该程序从网站上的链接中获取特定信息.我的代码如下:

I'm using a basic spider that gets particular information from links on a website. My code looks like this:

import sys
from scrapy import Request
import urllib.parse as urlparse
from properties import PropertiesItem, ItemLoader
from scrapy.crawler import CrawlerProcess    

class BasicSpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]
    start_urls = ['www.example.com']
    objectList = []
    def parse(self, response):
        # Get item URLs and yield Requests
        item_selector = response.xpath('//*[@class="example"]//@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin(response.url, url), callback=self.parse_item, dont_filter=True)

    def parse_item(self, response):
        L = ItemLoader(item=PropertiesItem(), response=response)
        L.add_xpath('title', '//*[@class="example"]/text()')
        L.add_xpath('adress', '//*[@class="example"]/text()')
        return L.load_item()

process = CrawlerProcess()
process.crawl(BasicSpider)
process.start()

我现在想要的是将每个类实例"L"附加到名为objectList的列表中.我尝试通过更改代码来做到这一点,例如:

What I want now is to append every class instance "L" to a list called objectList. I've tried do to so by altering the code like:

    def parse_item(self, response):
        global objectList
        l = ItemLoader(item=PropertiesItem(), response=response)
        l.add_xpath('title', '//*[@class="restaurantSummary-name"]/text()')
        l.add_xpath('adress', '//*[@class="restaurantSummary-address"]/text()')
        item = l.load_item()
        objectList.append([item.title, item.adress])
        return objectList       

但是,当我运行这段代码时,我收到一条消息:

But when I run this code I get a message saying:

l = ItemLoader(item=PropertiesItem(), response=response)
NameError: name 'PropertiesItem' is not defined

问::如何将刮板找到的每个项目附加到列表objectList?

Q: How do I append every item that the scraper finds to the list objectList?

我想将结果存储在列表中,因为这样便可以保存结果:

I want to store the results in a list, because I can then save the results like this:

import pandas as pd
table = pd.DataFrame(objectList)   
writer = pd.ExcelWriter('DataAll.xlsx')
table.to_excel(writer, 'sheet 1')
writer.save()

推荐答案

要保存结果,您应使用文档

To save results you should use scrapy's Feed Exporters feature as described in the documentation here

实施抓取工具时最常需要的功能之一是能够正确存储抓取的数据,这通常意味着生成一个包含要使用的抓取数据的导出文件"(通常称为导出提要")通过其他系统.

One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an "export file" with the scraped data (commonly called "export feed") to be consumed by other systems.

Scrapy在Feed导出中提供了开箱即用的功能,它允许您使用多种序列化格式和存储后端生成包含抓取的项目的Feed.

Scrapy provides this functionality out of the box with the Feed Exports, which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends.

请参见 csv部分您的情况.

另一种更自定义的方法是使用scrapy的项目管道.有一个简单的JSON编写器示例

Another, more custom, approach would be using scrapy's Item Pipelines. There's an example of simple json writer here that could be easily modified to output csv or any other format.

例如,这段代码会将所有项目输出到项目目录中的test.csv文件:

For example this piece of code would output all items to an test.csv file in project directory:

import scrapy
class MySpider(scrapy.Spider):
    name = 'feed_exporter_test'
    # this is equivalent to what you would set in settings.py file
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'test.csv'
    }
    start_urls = ['http://stackoverflow.com/questions/tagged/scrapy']

    def parse(self, response):
        titles = response.xpath("//a[@class='question-hyperlink']/text()").extract()
        for i, title in enumerate(titles):
            yield {'index': i, 'title': title}

此示例生成50行长的csv文件.

This example generates 50 row long csv file.

这篇关于如何将scrapy spider中的项目追加到列表中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆