如何有效地检测 XML 模式,而无需将整个文件放在 python 中 [英] How to efficiently detect an XML schema without having the entire file in python

查看:50
本文介绍了如何有效地检测 XML 模式,而无需将整个文件放在 python 中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的提要文件,它作为 XML 文档 (5GB) 发送.在不知道其结构的情况下解析主项目节点的结构的最快方法是什么?Python 中是否有一种方法可以在不将完整的 xml 加载到内存中的情况下即时"执行此操作?例如,如果我只保存文件的前 5MB 怎么办(它本身就是无效的 xml,因为它没有结束标记)——有没有办法从中解析架构?

I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?

更新:我在此处包含了一个示例 XML 片段:https://hastebin.com/uyaliciho​​w.xml.我正在寻找类似于以下内容的数据框(或列表或您想要使用的任何其他数据结构)之类的内容:

Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:

Items/Item/Main/Platform       Items/Item/Info/Name
iTunes                         Chuck Versus First Class
iTunes                         Chuck Versus Bo

这是怎么做到的?我在此处添加了奖励以鼓励回答.

How could this be done? I've added a bounty to encourage answers here.

推荐答案

这个问题很多人误解了,重新看了一遍,真的一点都不清楚.实际上有几个问题.

Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.

如何检测 XML 架构

How to detect an XML schema

有些人将此解释为您认为文件中可能存在架构,或从文件中引用.我将其解释为您想从实例的内容中推断出模式.

Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.

在不知道其结构的情况下解析主项目节点结构的最快方法是什么?

What would be the fastest way to parse the structure of the main item node without previously knowing its structure?

只需通过解析器即可,例如SAX 解析器.解析器无需知道 XML 文件的结构即可将其拆分为元素和属性.但我认为您实际上并不想要最快的解析(事实上,我认为您的需求列表中的性能根本没有那么高).我认为您想对信息做一些有用的事情(您还没有告诉我们什么):也就是说,您想处理信息,而不仅仅是解析 XML.

Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.

是否有一个 python 实用程序可以即时"执行此操作而无需完整的 xml 加载到内存中了吗?

Is there a python utility that can do so 'on-the-fly' without having the complete xml loaded in memory?

是的,根据这个页面提到了 Python 世界中的 3 个基于事件的 XML 解析器:https://wiki.python.org/moin/PythonXml(我不能保证其中任何一个)

Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)

如果我只保存文件的前 5MB 怎么办(它本身就是无效的 xml,因为它没有结束标记)——有没有办法从中解析架构?

what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?

我不确定您是否知道动词解析"的实际含义.您的短语当然表明您希望文件包含要提取的架构.但我完全不确定你真的是那个意思.无论如何,如果它在前 5Mb 中确实包含架构,您会发现它只是按顺序读取文件,无需先保存"文件的第一部分.

I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.

这篇关于如何有效地检测 XML 模式,而无需将整个文件放在 python 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆