如何指示 Scrapy 不序列化项目字段? [英] How can I instruct Scrapy to not serialize an item field?

查看:26
本文介绍了如何指示 Scrapy 不序列化项目字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为熟悉 Scrapy 的学习实验,我正在编写一个 Scraper,它检查 HTML 页面的所有链接并报告 HTTP 的状态代码 HEAD 请求定向到他们.事实是,在我的一个项目定义中,我有一个项目字段,即 parent_url,被视为元数据 - 也就是说,我并不打算在我的 Scraper 的输出中显示它.

As a learning experiment for familiarizing with Scrapy I'm writing a Scraper which checks all the links of a HTML page and reports the status codes of HTTP HEAD requests directed to them. Fact is, in one of my item definitions I have one item field, namely parent_url, treated as metadata - that is, I do not mean to display it in my Scraper's output.

parent_url定义在LinkItem类中,如下图:

class LinkItem(Item):
    name = Field()
    url = Field()
    parent_url = Field()   # Identifies what URL this item was extracted from
    status_code = Field()

为了从我的 Spider 的输出中省略 parent_url,我尝试过:

In order to omit parent_url from my Spider's output I've tried:

  1. __init__ 中的 parent_url 定义为实例属性 - 我在尝试访问它时引发了异常;
  2. __init__ 中分配给 self["parent_url"],但正如文档中已经指出的,Scrapy 不允许分配给未声明的字段;
  3. Field(serializer=None)Field(serializer=empty_function) 分配给 parent_url,这会在抓取和 JSON 时生成连续异常输出只有逗号.
  1. Defining parent_url in __init__ as an instance attribute - I got exceptions raised when trying to access it;
  2. Assigning to self["parent_url"] inside __init__, but as already noted by the documentation Scrapy doesn't let assigning to undeclared fields;
  3. Assigning Field(serializer=None) or Field(serializer=empty_function) to parent_url, which generated continuous exceptions while scraping and a JSON output with only commas.

尚未找到解决方案,我正在寻求外部帮助.parent_url 字段/属性在管道内部使用,我不知道还有什么可以替代它.

Not having yet come to a solution, I'm looking for external help. The parent_url field/attribute is used internally within a pipeline, and I don't know what else to substitute it with.

推荐答案

您可以指定字段,这些字段应该通过 FEED_EXPORT_FIELDS 设置.例如:

You can specify fields, which should be exported via FEED_EXPORT_FIELDS setting. For example:

# in `settings.py`
FEED_EXPORT_FIELDS = ['name', 'url', 'status_code']

这篇关于如何指示 Scrapy 不序列化项目字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆