如何使用 PHP 从文档中删除无效的 XML 字符 [英] How to remove invalid XML charactes from document with PHP

查看:26
本文介绍了如何使用 PHP 从文档中删除无效的 XML 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图生成一个大约 23 到 30 MB 的 XML 文档,当我用 Firefox 打开它时,我收到

XML 解析错误:格式不正确位置:file:///Users/User/Downloads/export(2).xml行号 137725,列 1343:

之后,我尝试使用 XML Nanny 验证文档,然后我收到以下错误:

无效字符 (Unicode: 0xB)

在几 (13) 行上:137725、137738、137751、137764、137777、137790、137803、137816、146834、189949、191934

我尝试了几种解决方案",其中包括:

  1. 正则表达式:

    preg_replace('/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/', ' ', $data->Description);

    这里的问题是我不太确定这是有效的 RegEx,因为我收到内部服务器错误,因为我们的 apache 中启用了 mod 安全性.

  2. 我试图用 BOM 以 UTF-8 格式保存我的文件,但那是绝望的尝试

  3. 我尝试将 iconv 与UTF-8//IGNORE"一起使用,但这没有帮助

  4. 我尝试使用逐个字符替换,但这对我的文件不起作用,因为我有 230k 行..即使我替换了我遇到这个问题的特定标签在 php 中触发 max_execution_time 指令,我的脚本被杀死.

目前我的解决方案是手动清除此无效字符的数据库记录,但现在这是解决我的问题的正确方法,因为将来此脚本将用于自动导出,而手动编辑不是选项或演讲主题.

解决方案

我首先会坚持 XML Nanny 提供的信息:

无效字符(Unicode:0xB)(多行)

0xB 是控制字符范围内的一个字符,但在 XML 文档中只允许非常有限的控制字符.我建议你开始用数字实体替换它们,然后再试一次:

$xml = strtr($xml, array("\x0B" => ""));

Firefox 可能会接受这些.

I trying to generate an XML document which is around 23 to 30 MB, when i open it with Firefox i receive

XML Parsing Error: not well-formed
Location: file:///Users/User/Downloads/export(2).xml
Line Number 137725, Column 1343:

After that I try to validate the document with XML Nanny and I receive the following error:

Invalid Character (Unicode: 0xB)

On several (13) lines: 137725, 137738, 137751, 137764, 137777, 137790, 137803, 137816, 146834, 189949, 193444, 193457, 193470

I've tried several "solutions" which include:

  1. Regular Expression:

    preg_replace(
      '/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/'
      , ' ', $data->Description);
    

    The problem here is that I'am not quite sure that this is valid RegEx, because I receive Internal Server Error because of enabled mod security in our apache.

  2. I've tried to save my file in UTF-8 with BOM, but that was desperate trying

  3. I've tried to use iconv with 'UTF-8//IGNORE' but and this didn't help

  4. I've tried to use character by character replacement, but this didn't work well with my file because i have 230k lines.. even if i replace the specific tag in which i have that problem i've trigger max_execution_time directive in php and my script is killed.

For now my solution is to clear database records of this invalid characters manually, but this is now proper and correct solution to my problem because in future this script will be used to automate this export and manual editing isn't option or subject of speech.

解决方案

I'd first of all stick to the information given by XML Nanny:

Invalid Character (Unicode: 0xB) (several lines)

0xB is a character from the control character range, but only very limited control characters are allowed in a XML document. I suggest you start replacing those with numerical entities and try again:

$xml = strtr($xml, array("\x0B" => ""));

Firefox might accept those.

这篇关于如何使用 PHP 从文档中删除无效的 XML 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆