将 Scrapy 的输出格式化为 XML [英] Formatting Scrapy's output to XML

查看：53 发布时间：2021/7/16 22:06:51 python xml web-scraping web-crawler scrapy

本文介绍了将 Scrapy 的输出格式化为 XML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，当我将其导出为 XML 时，我试图将使用 Scrapy 从网站上抓取的数据导出为特定格式.

So I am attempting to export data scraped from a website using Scrapy to be in a particular format when I export it to XML.

这是我希望我的 XML 的样子:

Here is what I would like my XML to look like:

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <row>
    <field1><![CDATA[Data Here]]></field1>
    <field2><![CDATA[Data Here]]></field2>
  </row>
</data>

我正在使用以下命令运行我的抓取:

I am running my scrape by using the command:

$ scrapy crawl my_scrap -o items.xml -t xml

我得到的当前输出是:

<?xml version="1.0" encoding="utf-8"?>
<items><item><field1><value>Data Here</value></field1><field2><value>Data Here</value></field2></item>

如您所见，它正在添加字段，但我无法重命名根节点或项目节点.我知道我需要使用 XmlItemExporter，但我不是确定如何在我的项目中实现这一点.

As you can see it is adding the <value> fields and I am not able to rename the root nodes or item nodes. I know that I need to use XmlItemExporter, but I am not sure how to go about implementing this in my project.

我已尝试将其添加到 pipelines.py 中，如下所示 here 但我总是以错误告终:

I have tried to add it to the pipelines.py as it is shown here but I always end up with with the error:

AttributeError: 'CrawlerProcess' 对象没有属性 'signals'

是否有任何机构知道如何在使用 XmlItemExporter 将数据导出到 XML 时重新格式化数据的示例?

Does any body know of examples of how to reformat the data when exporting it to XML using the XmlItemExporter?

在我的 piplines.py 模块中显示我的 XmlItemExporter:

Showing my XmlItemExporter in my piplines.py module:

from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class XmlExportPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

编辑(显示修改和追溯):

我修改了spider_opened函数:

 def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file, 'data', 'row')
        self.exporter.start_exporting()

我得到的回溯是:

Traceback (most recent call last):
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/core/engine.py", line 265, in <lambda>
            spider=spider, reason=reason, spider_stats=self.crawler.stats.get_stats()))
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
            return signal.send_catch_log_deferred(*a, **kw)
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
            *arguments, **named)
        --- <exception caught here> ---
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 134, in maybeDeferred
            result = f(*args, **kw)
          File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
            return receiver(*arguments, **named)
          File "/root/self_opportunity/self_opportunity/pipelines.py", line 28, in spider_closed
            self.exporter.finish_exporting()
        exceptions.AttributeError: 'XmlExportPipeline' object has no attribute 'exporter'

推荐答案

你可以让 XmlItemExporter 做你想做的大部分事情，只需提供你想要的节点的名字:

You can make XmlItemExporter do most of what you want simply by supplying the names of the nodes you want:

XmlItemExporter(file, 'data', 'row')

查看文档.

您在字段中使用 value 元素的问题是因为这些字段不是标量值.如果 XmlItemExporter 遇到标量值，它只会输出 data，但如果遇到可迭代值，它会像这样序列化:<;值>data1</value><value>data2</value></fieldname>.解决方案是停止为您的项目发出非标量字段值.

The problem you have with value elements in your fields is because those fields are not scalar values. If XmlItemExporter encounters a scalar value, it simply outputs <fieldname>data</fieldname>, but if it encounters an iterable value, it will serialize like this: <fieldname><value>data1</value><value>data2</value></fieldname>. The solution is to stop emitting non-scalar field values for your items.

如果您不愿意这样做，则将 XmlItemExporter 子类化并覆盖其 _export_xml_field 方法以在项目值可迭代时执行您想要的操作.这是XmlItemExporter的代码这样你就可以看到实现了.

If you aren't willing to do this, subclass XmlItemExporter and override its _export_xml_field method to do what you want when the item value is iterable. This is the code for XmlItemExporter so you can see the implementation.

这篇关于将 Scrapy 的输出格式化为 XML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将 Scrapy 的输出格式化为 XML [英] Formatting Scrapy's output to XML

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将 Scrapy 的输出格式化为 XML [英] Formatting Scrapy&#39;s output to XML

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

将 Scrapy 的输出格式化为 XML [英] Formatting Scrapy's output to XML

登录关闭