PHP - 读取和修复大的无效 XML 文件 [英] PHP - Read and repair big invalid XML files
问题描述
我必须阅读一些非常重的 XML 文件(在 200 MB 到 1 GB 之间),其中一些文件是无效的.让我给你举个小例子:
I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<item>
<title>Some article</title>
<g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
</item>
</rss>
显然,g:material
标签中缺少一个 </ul>
结束标签.此外,开发此提要的人应该将 g:material
内容包含在 CDATA
中,但他们没有......基本上,这就是我想要做的:添加这个缺少 CDATA
部分.
Obviously, there is a missing </ul>
closing tag in the g:material
tag. Moreover, people that have developed this feed should have enclosed g:material
content into CDATA
, which they did not... Basically, that's what I want to do : add this missing CDATA
section.
我尝试使用 SAX 解析器来读取此文件,但在读取 </g:material>
标记时失败,因为 </ul>
标签丢失.我已经尝试过 XMLReader,但遇到了基本相同的问题.我可能可以用 DomDocument::loadHtml 做一些事情,但是这个文件的大小与 DOM 方法并不真正兼容.你知道我如何可以简单地修复这个提要而不必为 DomDocument 购买大量的 RAM 来工作吗?谢谢.
I've tried to use a SAX parser to read this file but it fails when reading the </g:material>
tag since the </ul>
tag is missing. I've tried with XMLReader but got basically the same issue.
I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach.
Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ?
Thanks.
推荐答案
如果文件太大无法使用 Tidy 扩展,可以使用tidy CLI 工具a> 使文件可解析.
If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.
$ tidy -output my.clean.xml my.xml
之后,XML 文件格式良好,因此您可以使用 XMLReader 解析它们.由于 tidy 添加了缺少的"(X)HTML 部分,因此原始文档的代码位于元素内.
After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.
这篇关于PHP - 读取和修复大的无效 XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!