Xml 中的非法字符 [英] Illegal character in Xml

查看:33
本文介绍了Xml 中的非法字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 PHP 文件,它根据从多个来源导入的数据生成 Xml 站点地图.由于导入数据的一行中存在非法字符,我的站点地图目前格式不正确,但我正在努力将其删除.

I have a PHP file which produces an Xml sitemap based on data which has been imported from a number of sources. My sitemap is currently not well formed due to an illegal character in one line of the imported data however I am struggling to remove it.

该字符看起来代表平方"或上标 2,并表示为一个正方形.我尝试将其粘贴到十六进制编辑器中,但它显示为 ?,并且十六进制代码也对应于 ?.我还尝试使用 iconv 将所有源编码转换为所有目标编码,没有删除此字符的组合.

The character looks to represent the 'squared' or superscript 2, and is represented as a square. I have tried pasting this into a hex editor however it is shown as a ?, and the hex code also corresponds to ?. I have also tried using iconv to convert from all source encodings to all destination encodings, with no combination removing this character.

我还有以下功能来删除非ascii字符:

I also have the following function to remove non-ascii characters:

function stripInvalidXml($value)
{
    $ret = "";
    $current;
    if (empty($value)) 
    {
        return $ret;
    }

    $length = strlen($value);
    for ($i=0; $i < $length; $i++)
    {
        $current = ord($value{$i});
        if (($current == 0x9) ||
            ($current == 0xA) ||
            ($current == 0xD) ||
            (($current >= 0x20) && ($current <= 0xD7FF)) ||
            (($current >= 0xE000) && ($current <= 0xFFFD)) ||
            (($current >= 0x10000) && ($current <= 0x10FFFF)))
        {
            if($current != 0x1F)
            {
                $ret .= chr($current);
            }
        }
        else
        {
            $ret .= " ";
        }
    }


    return $ret;
}

然而,这仍然没有删除它.如果我单步执行代码,非法字符将扩展为 ￿在 Eclipse 调试窗口中.它有问题的字符串如下(希望它正确粘贴)

However this still is not removing it. If I step through the code the illegal character is expanded out to ￿ in eclipses debug window. The string it is having issues with is below (hoping it pastes correctly)

251gm-50

任何关于将删除此字符并防止此表单发生的函数的想法都非常感谢 - 我对导入的数据几乎没有控制,因此需要在 Xml 生成时完成.

Any ideas on a function which will remove this character and prevent this form occurring are much appreciated - I have little control over the data that is imported so it needs to be done at the point of Xml generation.

编辑

发布后,我可以看到字符显示不正确.在 Eclipses 窗口中查看时,它显示为 &#65535;(没有空格 - 如果我在其中留空格会呈现字符,看起来像 ￿)

After posting I can see that the character doesn't appear correctly. When viewing in Eclipses window it appears as & # 65535 ; (without spaces - if I leave spaces in it renders the character, which looks like ￿)

推荐答案

我想我看错了路 - 而不是编码问题字符是表示平方"符号的 HTML 实体.由于 URL 中的描述仅用于搜索目的,我可以使用以下正则表达式安全地删除所有 htmlentities:

I think I was looking down the wrong path - rather than an encoding issue character was an HTML entity representing the 'squared' symbol. As the descriptions in the URL only exist for search enging purposes I can safely remove all htmlentities with the following regex:

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);

这篇关于Xml 中的非法字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆