如何使用PHP替换字符串中的非SGML字符? [英] How to substitute non SGML characters in String using PHP?

查看:161
本文介绍了如何使用PHP替换字符串中的非SGML字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用PHP4和HTML 4.01(使用字符集ISO-8859-15,即latin-9)编写了留言板。数据以字符集(ISO-8859-1,即latin-1)保存在MySQL数据库中。



当有人从不同的字符集输入字符时似乎浏览器发送数据编码(实际上我没有检查它在哪里编码,...)。



无论如何,未保存编码在数据库中。因此,当我添加显示HTML4.01文档中的数据时,验证器返回一个错误消息:


非SGML字符编号146

您在文本中使用了非法字符。 HTML使用标准的UNICODE Con​​sortium字符集,并留下
未定义的65个字符代码(0到31包括和127
到159包括),有时用于打印报价
标记和类似的专有字符集。验证器有
在文档中找到这些未定义的字符之一。
字符可能作为卷曲报价或商标
符号或某些其他花式字形在您的浏览器上显示;在不同的计算机上,然而,
它可能会显示为一个完全不同的字符,或没有什么



用最接近的等价
ASCII字符替换字符,或使用适当的字符实体。有关更多
有关网络字符编码的信息,请参阅Alan Flavell的
优秀HTML字符集问题参考。



此错误也可以通过通过一些字处理器格式化嵌入在
文档中的字符。如果使用字处理程序编辑
您的HTML文档,请务必使用另存为ASCII或类似的
命令保存文档而不保存格式信息。


我现在使用PHP5.2.17,和htmlspecialchars玩了一点,但没有什么工作。如何对这些字符进行编码,以便不再有验证错误?

解决方案

在ISO-8859-1和ISO -8859-15中,字符号146是来自 MW (消息等待) rel =nofollow> C1范围



SGML指ISO 8859-1(注意ISO和8859-1之间的空格,连字符作为您使用的字符集)。它不允许使用控制字符,但三个(此处: HTML中的SGML ):


在HTML文档字符集中只允许使用三个控制字符:水平
Tab,回车和换行9,13和10)。


因此,您通过了一个非法字符。没有SGML / HTML实体可以替换它。



我建议您验证输入到您的应用程序,它不允许控制字符。如果你认为这些字符最初代表一个有用的东西,比如一个可以实际读取的字母(例如,不是控制字符),很可能在处理数据时,编码在某一时刻被破坏。



从你的问题中给出的信息很难说,在哪里,因为你只指定输入编码和数据库归档的编码 - 但那两个已经不匹配(这不应该产生你问的问题,但它可以产生其他问题)。在这两个地方旁边,还有数据库客户端连接字符集(在您的问题中未指定),输出编码(在您的问题中未指定)和响应内容编码(在您的问题中未指定)。



您可以将整体编码更改为UTF-8,以支持更广泛的字符,但这确实是一个可能



修改:上面的部分是一个严格的视图。我的想法是,您收到的输入不是ISO-8859-1(5)实际上,而是别的,如Windows代码页。我可能会说,这是 Windows-1252(cp1252) ­ Wikipedia



维基百科页面还指出,大多数浏览器处理ISO-8859-1(128-159)的C1范围, 8859-1作为Windows-1252 / CP1252 / CP-1252。 PHP htmlentities()功能无法处理这些字符,HTML实体的翻译表不覆盖代码点(PHP 5.3,未针对5.4测试)。您需要创建自己的翻译表,并使用 strtr 替换ISO 8859-15中对于Windows-1252不可用的字符:

  / * 
*映射Windows-1252(cp1252)128(0x80) - 159(0x9F)字符:
* @link http://en.wikipedia.org/wiki/Windows-1252
* @link http:// www.w3.org/TR/html4/sgml/entities.html
* /
$ cp1252HTML401Entities = array(
\x80=>'& euro;',# 128 - >欧元符号,U + 20AC新
\x82=>'& sbquo;',#130 - >单一低9引号,U + 201A NEW
\x83=>''& fnof;',#131 - > latin small f with hook = function = florin,U + 0192 ISOtech
\x84=>'& bdquo ;',#132->双低-9引号,U + 201E NEW
\x85=>'& hellip;',#133-& U + 2026 ISOpub
\x86=>'& dagger;',#134 - > dagger,U + 2020 ISOpub
\x87=> '& Dagger;',#135 - >双匕首,U + 2021 ISOpub
\x88=> '& circ;',#136 - >修饰符字母减音字母,U + 02C6 ISOpub
\x89=> '& permil;',#137 - >每千米符号,U + 2030 ISOtech
\x8A=> '& Scaron;',#138 - >拉丁大写字母S与卡隆,U + 0160 ISOlat2
\x8B=> '& lsaquo;',#139 - >单个左指向角引号,U + 2039 ISO建议
\x8C=> '& OElig;',#140 - >拉丁资本连带OE,U + 0152 ISOlat2
\x8E=> 'Ž',#142 - > U + 017D
\x91=> '& lsquo;',#145 - >左单引号,U + 2018 ISOnum
\x92=> '& rsquo;',#146 - >右单引号,U + 2019 ISOnum
\x93=> '& ldquo;',#147 - >左双引号,U + 201C ISOnum
\x94=> '& rdquo;',#148 - >右双引号,U + 201D ISOnum
\x95=> '& bull;',#149 - > bullet = black small circle,U + 2022 ISOpub
\x96=> '& ndash;',#150 - > en dash,U + 2013 ISOpub
\x97=> '& mdash;',#151 - > em dash,U + 2014 ISOpub
\x98=> '& tilde;',#152 - >小颚化符,U + 02DC ISOdia
\x99=> '& trade;',#153 - >商标标志,U + 2122 ISOnum
\x9A=> '& scaron;',#154 - >拉丁小写字母s与卡隆,U + 0161 ISOlat2
\x9B=> '& rsaquo;',#155 - >单向右倾角引号,U + 203A ISO提议
\x9C=> '& oelig;',#156 - >拉丁小连字oe,U + 0153 ISOlat2
\x9E=> 'ž',#158 - > U + 017E
\x9F=> '& Yuml;',#159 - >拉丁语大写字母Y,带撇号,U + 0178 ISOlat2
);

$ outputWithEntities = strtr($ output,$ cp1252HTML401Entities);

如果您想要更安全,可以省略命名实体,应该在非常旧的浏览器中工作的那些:

  $ cp1252HTMLNumericEntities = array(
\x80= >'€',#128->欧元符号,U + 20AC NEW
\x82=>'‚',#130-& 9 quotation mark,U + 201A NEW
\x83=>'ƒ',#131 - > latin small f with hook = function = florin,U + 0192 ISOtech
\x84=>'„',#132 - >双低低9引号,U + 201E NEW
\x85=>'&#8230 ;',#133 - > horizo​​ntal ellipsis = three dot leader,U + 2026 ISOpub
\x86=>'8224;',#134 - > dagger,U + 2020 ISOpub
\x87=>'‡',#135 - > double dagger,U + 2021 ISOpub
\x88=& ',#136 - >修饰语字母回音重音,U + 02C6 ISOpub
\x89=> '‰',#137 - >每千米符号,U + 2030 ISOtech
\x8A=> 'Š',#138 - >拉丁大写字母S与卡隆,U + 0160 ISOlat2
\x8B=> '‹',#139 - >单个左指向角引号,U + 2039 ISO建议
\x8C=> 'Œ',#140 - >拉丁资本连带OE,U + 0152 ISOlat2
\x8E=> 'Ž',#142 - > U + 017D
\x91=> '‘',#145 - >左单引号,U + 2018 ISOnum
\x92=> '’',#146 - >右单引号,U + 2019 ISOnum
\x93=> '“',#147 - >左双引号,U + 201C ISOnum
\x94=> '”',#148 - >右双引号,U + 201D ISOnum
\x95=> '•',#149 - > bullet = black small circle,U + 2022 ISOpub
\x96=> '–',#150 - > en dash,U + 2013 ISOpub
\x97=> '—',#151 - > em dash,U + 2014 ISOpub
\x98=> '˜',#152 - >小波浪号,U + 02DC ISOdia
\x99=> '™',#153 - >商标标志,U + 2122 ISOnum
\x9A=> 'š',#154 - >拉丁小写字母s与卡隆,U + 0161 ISOlat2
\x9B=> '›',#155 - >单向右倾角引号,U + 203A ISO提议
\x9C=> 'œ',#156 - >拉丁小连字oe,U + 0153 ISOlat2
\x9E=> 'ž',#158 - > U + 017E
\x9F=> 'Ÿ',#159 - >拉丁语大写字母Y,带撇号,U + 0178 ISOlat2
);

希望这更有帮助。另请参见以上链接的一些字符在windows-1242 ISO 8859-15 在不同点的维基百科页面。您应该考虑在您的网站上使用UTF-8。


I programmed a guestbook using PHP4 and HTML 4.01 (with the charset ISO-8859-15, i.e. latin-9). The data is saved in a MySQL-database with the charset (ISO-8859-1, i.e. latin-1).

When somebody enters characters from a different charset, it seems that the browsers send the data encoded (actually I have not checked where it gets encoded, ...).

Anyway, in some cases, it seems that characters are not saved encoded in the database. Thus, the validator returns an error message when I add show the data within an HTML4.01 document:

non SGML character number 146

You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical quote marks and similar in proprietary character sets. The validator has found one of these undefined characters in your document. The character may appear on your browser as a curly quote, or a trademark symbol, or some other fancy glyph; on a different computer, however, it will likely appear as a completely different character, or nothing at all.

Your best bet is to replace the character with the nearest equivalent ASCII character, or to use an appropriate character entity. For more information on Character Encoding on the web, see Alan Flavell's excellent HTML Character Set Issues reference.

This error can also be triggered by formatting characters embedded in documents by some word processors. If you use a word processor to edit your HTML documents, be sure to use the "Save as ASCII" or similar command to save the document without formatting information.

I'm now using PHP5.2.17, and played a bit with htmlspecialchars, but nothing worked. How can I encode thoses characters, so that there are no more validation errors?

解决方案

In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.

SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.

I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.

From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).

It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.

Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)­Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

/*
 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
 */
$cp1252HTML401Entities = array(
    "\x80" => '€',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:

$cp1252HTMLNumericEntities = array(
    "\x80" => '€',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.

这篇关于如何使用PHP替换字符串中的非SGML字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆