PHP - 读取和修复大的无效 XML 文件 [英] PHP - Read and repair big invalid XML files

查看:26
本文介绍了PHP - 读取和修复大的无效 XML 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须阅读一些非常重的 XML 文件(在 200 MB 到 1 GB 之间),其中一些文件是无效的.让我给你举个小例子:

I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <item>
    <title>Some article</title>
    <g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
  </item>
</rss>

显然,g:material 标签中缺少一个 </ul> 结束标签.此外,开发此提要的人应该将 g:material 内容包含在 CDATA 中,但他们没有......基本上,这就是我想要做的:添加这个缺少 CDATA 部分.

Obviously, there is a missing </ul> closing tag in the g:material tag. Moreover, people that have developed this feed should have enclosed g:material content into CDATA, which they did not... Basically, that's what I want to do : add this missing CDATA section.

我尝试使用 SAX 解析器来读取此文件,但在读取 </g:material> 标记时失败,因为 </ul> 标签丢失.我已经尝试过 XMLReader,但遇到了基本相同的问题.我可能可以用 DomDocument::loadHtml 做一些事情,但是这个文件的大小与 DOM 方法并不真正兼容.你知道我如何可以简单地修复这个提要而不必为 DomDocument 购买大量的 RAM 来工作吗?谢谢.

I've tried to use a SAX parser to read this file but it fails when reading the </g:material> tag since the </ul> tag is missing. I've tried with XMLReader but got basically the same issue. I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach. Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ? Thanks.

推荐答案

如果文件太大无法使用 Tidy 扩展,可以使用tidy CLI 工具a> 使文件可解析.

If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.

$ tidy -output my.clean.xml my.xml

之后,XML 文件格式良好,因此您可以使用 XMLReader 解析它们.由于 tidy 添加了缺少的"(X)HTML 部分,因此原始文档的代码位于元素内.

After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.

这篇关于PHP - 读取和修复大的无效 XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆