scrapy不会添加所有项目中都不存在的字段吗? [英] scrapy doesn't add fields which are not present in all items?

查看:67
本文介绍了scrapy不会添加所有项目中都不存在的字段吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从链接中获取字段a,b,c,并生成一个OrderedDict.但是如果满足条件我还没有屈服,首先我对另一个链接进行请求,将a,b,c字典传输到该请求(通过部分),并从第二个链接获取字段d,e并屈服d,e,a,b,c.

I get fields a,b,c from a link and yield an OrderedDict. But if a condition is met I don't yield yet, first I do a request for another link , transmit the a,b,c dictionary to that request (through partial) and also get fields d,e from the second link and yield d,e,a,b,c.

因此,某些项目应具有字段d,e,a,b,c有些项目应该只有a,b,c

So some items should have the fields d,e,a,b,c and some items should have just a,b,c

当我打印OrderedDicts时,它们是正确的:

When I print the OrderedDicts they are correct:

the second OrderedDict has keys d,e,a,b,c
the first OrderedDict has keys a,b,c

但是在导出的.csv文件中,我仅看到a,b,c列.

But in the exported .csv file I only see a,b,c columns.

所以我的问题是:不显示所有项目中都不存在的字段吗?

So my question is : is scrapy not showing fields which are not present in all items ?

注意:按字段,我只是指列标题,我不使用scrapy的Item和Fields类,我只使用OrderedDict

Note: by field I just mean column header, I do NOT use scrapy's Item and Fields classes, I just use OrderedDict

更新:我已经设法解决了我的问题,只产生了一个字典(更新了第一个字典).但是我仍然对上面的问题感到好奇.

Update: I've managed to solve my problem by yielding a single dict (updating the first dict). But I am still curious about the question above.

推荐答案

首先让我们快速了解

Let's first have a quick look at related source code in scrapy.exporters.CsvItemExporter:

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))
        self.csv_writer.writerow(values)

    def _write_headers_and_set_fields_to_export(self, item):
        if self.include_headers_line:
            if not self.fields_to_export:
                if isinstance(item, dict):
                    # for dicts try using fields of the first item
                    self.fields_to_export = list(item.keys())
                else:
                    # use fields declared in Item
                    self.fields_to_export = list(item.fields.keys())
            row = list(self._build_row(self.fields_to_export))
            self.csv_writer.writerow(row)

导出器本身处理流数据,这意味着它不能在写入文件之前缓冲所有所有蜘蛛输出.因此,CSV导出器只需要从第一项推断出标题即可.

The exporter itself deals with streaming data, which means it cannot buffer all spider outputs before writing to the files. Thus the CSV exporter has to infer the headers from only the 1st item.

如果您使用的是 scrapy.Item ,则应该完全没有问题.否则,如果您使用的是Python的 dict ,则第一项的字段名称将用作CSV标头.

If you're using scrapy.Item, there shall have been no problem at all. Otherwise if you're using Python's dict, fieldnames of the 1st item would be used as the CSV headers.

这篇关于scrapy不会添加所有项目中都不存在的字段吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆