仅检索 XML 提要的一部分 [英] Retrieve only a portion of an XML feed

查看：25 发布时间：2021/7/16 22:22:18 python xml web-scraping scrapy

本文介绍了仅检索 XML 提要的一部分的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Scrapy XMLFeedSpider 来解析来自网站的大型 XML 提要(60MB)，我只是想知道是否有办法只检索其中的一部分而不是全部 60MB，因为现在消耗的 RAM 是相当高，也许可以在链接中添加一些内容，例如:

I'm using Scrapy XMLFeedSpider to parse a big XML feed(60MB) from a website, and i was just wondering if there is a way to retrieve only a portion of it instead of all 60MB because right now the RAM consumed is pretty high, maybe something to put in the link like:

"http://site/feed.xml?limit=10"，我'我搜索过是否有类似的东西，但我没有找到任何东西.

"http://site/feed.xml?limit=10", i've searched if there is something similar to this but i haven't found anything.

另一个选项是限制scrapy解析的项目，但我不知道该怎么做.现在一旦XMLFeedSpider解析了整个文档，机器人将只分析前十个项目，但我认为整个饲料仍将在内存中.您知道如何提高机器人的性能，减少 RAM 和 CPU 消耗吗?谢谢

Another option would be limit the items parsed by scrapy, but i don't know how to do that.Right now once the XMLFeedSpider parsed the whole document the bot will analyze only the first ten items, but i supposes that the whole feed will still be in the memory. Have you any idea on how to improve the bot's performance , diminishing the RAM and CPU consumption? Thanks

推荐答案

当您处理大型 xml 文档并且不想像 DOM 解析器那样将整个文档加载到内存中时.您需要切换到 SAX 解析器.

When you are processing large xml documents and you don't want to load the whole thing in memory as DOM parsers do. You need to switch to a SAX parser.

SAX 解析器比 DOM 风格的解析器有一些优势.SAX 解析器只需要在每个解析事件发生时报告，并且通常一旦报告，几乎所有的信息都会被丢弃(它确实，然而，保留一些东西，例如所有元素的列表还没有被关闭，为了捕捉后面的错误，比如以错误的顺序结束标签).因此，一个所需的最小内存SAX 解析器与 XML 文件的最大深度成正比(即的 XML 树)和单个 XML 事件中涉及的最大数据(例如单个开始标签的名称和属性，或内容处理指令等).

SAX parsers have some benefits over DOM-style parsers. A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

对于 60 MB 的 XML 文档，与创建 DOM 的要求相比，这可能非常低.大多数基于 DOM 的系统实际上在低得多的级别上使用来构建树.

For a 60 MB XML document, this is likely to be very low compared to the requirments for creating a DOM. Most DOM based systems actually use at a much lower level to build up the tree.

为了创建利用 sax，子类 xml.sax.saxutils.XMLGenerator 和重写器 endElement、startElement 和 字符.然后用它调用 xml.sax.parse .很抱歉，我手头没有详细的示例可以与您分享，但我相信您会在网上找到很多.

In order to create make use of sax, subclass xml.sax.saxutils.XMLGenerator and overrider endElement, startElement and characters. Then call xml.sax.parse with it. I am sorry I don't have a detailed example at hand to share with you, but I am sure you will find plenty online.

这篇关于仅检索 XML 提要的一部分的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

仅检索 XML 提要的一部分 [英] Retrieve only a portion of an XML feed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

仅检索 XML 提要的一部分 [英] Retrieve only a portion of an XML feed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭