将 Scrapy 的输出格式化为 XML [英] Formatting Scrapy's output to XML
问题描述
因此,当我将其导出为 XML 时,我试图将使用 Scrapy 从网站上抓取的数据导出为特定格式.
So I am attempting to export data scraped from a website using Scrapy to be in a particular format when I export it to XML.
这是我希望我的 XML 的样子:
Here is what I would like my XML to look like:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<row>
<field1><![CDATA[Data Here]]></field1>
<field2><![CDATA[Data Here]]></field2>
</row>
</data>
我正在使用以下命令运行我的抓取:
I am running my scrape by using the command:
$ scrapy crawl my_scrap -o items.xml -t xml
我得到的当前输出是:
<?xml version="1.0" encoding="utf-8"?>
<items><item><field1><value>Data Here</value></field1><field2><value>Data Here</value></field2></item>
如您所见,它正在添加
字段,但我无法重命名根节点或项目节点.我知道我需要使用 XmlItemExporter,但我不是确定如何在我的项目中实现这一点.
As you can see it is adding the <value>
fields and I am not able to rename the root nodes or item nodes. I know that I need to use XmlItemExporter, but I am not sure how to go about implementing this in my project.
我已尝试将其添加到 pipelines.py
中,如下所示 here 但我总是以错误告终:
I have tried to add it to the pipelines.py
as it is shown here but I always end up with with the error:
AttributeError: 'CrawlerProcess' 对象没有属性 'signals'
是否有任何机构知道如何在使用 XmlItemExporter
将数据导出到 XML 时重新格式化数据的示例?
Does any body know of examples of how to reformat the data when exporting it to XML using the XmlItemExporter
?
在我的 piplines.py
模块中显示我的 XmlItemExporter:
Showing my XmlItemExporter in my piplines.py
module:
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter
class XmlExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_products.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
编辑(显示修改和追溯):
我修改了spider_opened
函数:
def spider_opened(self, spider):
file = open('%s_products.xml' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file, 'data', 'row')
self.exporter.start_exporting()
我得到的回溯是:
Traceback (most recent call last):
File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/core/engine.py", line 265, in <lambda>
spider=spider, reason=reason, spider_stats=self.crawler.stats.get_stats()))
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/signalmanager.py", line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/utils/signal.py", line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File "/root/self_opportunity/venv/lib/python2.6/site-packages/twisted/internet/defer.py", line 134, in maybeDeferred
result = f(*args, **kw)
File "/root/self_opportunity/venv/lib/python2.6/site-packages/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File "/root/self_opportunity/self_opportunity/pipelines.py", line 28, in spider_closed
self.exporter.finish_exporting()
exceptions.AttributeError: 'XmlExportPipeline' object has no attribute 'exporter'
推荐答案
你可以让 XmlItemExporter
做你想做的大部分事情,只需提供你想要的节点的名字:
You can make XmlItemExporter
do most of what you want simply by supplying the names of the nodes you want:
XmlItemExporter(file, 'data', 'row')
查看文档.
您在字段中使用 value
元素的问题是因为这些字段不是标量值.如果 XmlItemExporter 遇到标量值,它只会输出
,但如果遇到可迭代值,它会像这样序列化:
.解决方案是停止为您的项目发出非标量字段值.
The problem you have with value
elements in your fields is because those fields are not scalar values. If XmlItemExporter encounters a scalar value, it simply outputs <fieldname>data</fieldname>
, but if it encounters an iterable value, it will serialize like this: <fieldname><value>data1</value><value>data2</value></fieldname>
. The solution is to stop emitting non-scalar field values for your items.
如果您不愿意这样做,则将 XmlItemExporter
子类化并覆盖其 _export_xml_field
方法以在项目值可迭代时执行您想要的操作.这是XmlItemExporter
的代码 这样你就可以看到实现了.
If you aren't willing to do this, subclass XmlItemExporter
and override its _export_xml_field
method to do what you want when the item value is iterable. This is the code for XmlItemExporter
so you can see the implementation.
这篇关于将 Scrapy 的输出格式化为 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!