提交带有Unicode的表单时如何避免浏览器Unicode规范化 [英] How to avoid browsers Unicode normalization when submitting a form with Unicode

查看:180
本文介绍了提交带有Unicode的表单时如何避免浏览器Unicode规范化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当用HTML呈现以下Unicode文本时,事实证明浏览器(Google Chrome)执行某种形式的将数据发布回服务器时的Unicode规范化. (可能使用表格C ).

When rendering the following Unicode text in HTML, it turns out that the browser (Google Chrome) do some form of Unicode normalization when posting the data back to the server. (Probably in Form C).

但是,当使用圣经希伯来语(בְּרִיךְהוּא)文本时,这很容易破坏文本,如此处(第9页).

But when using Biblical Hebrew (בְּרִיךְ הוּא) text, this can easily break the text, as it outlined in here (page 9).

有什么方法可以避免浏览器的自动文本规范化?

Is there any way to avoid the browsers auto text normalization?

我写了一篇博客文章,更详细地描述了我所面临的问题:

I wrote a blog post that describe in more details the issue that I'm facing: http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text

推荐答案

这似乎是WebKit浏览器(Chrome,Safari)中的功能/错误;他们将表单数据标准化为NFC,这意味着除其他功能外,将连续的合并标记重新排序为规范"顺序.这对我来说是新的,在这种情况下,这是个坏消息.最糟糕的是,不同的浏览器的行为有所不同.

This seems to a be a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC, which means, among other things, reordering consecutive combining marks to a "canonical" order. This was new to me, and bad news in cases like this. The worst thing is that different browsers behave differently.

使用测试用例的简化版本

Using a simplified version of your test case http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text (using a server-side script that just echoes the raw data), I noticed that Chrome and Safari reorder the diacritic marks in U+05E9 U+05C1 U+05B5 (SHIN, SHIN DOT, TSERE), whereas IE, Firefox, and Opera do not.

我还对拉丁字母e进行了一个简单的测试,然后组合了透尿病U + 0308. WebKit浏览器根据NFC规则将其转换为单个字符ë,而其他浏览器则将字符对保持不变.

I also ran a simple test with Latin letter e followed by combinining diaeresis U+0308. WebKit browsers convert it to the single character ë, as per NFC rules, whereas other browsers keep the character pair intact.

自2006年以来,这似乎是故意的. https://bugs.webkit.org/show_bug.cgi?id=8769 自豪地宣布了这一点错误修复!这可能可以解释W3C政策文件的状态.当前版本是WebKit,但其他浏览器供应商对此并不感兴趣,或者明知反对早期规范化".

This seems to be an intentional feature, ever since 2006; https://bugs.webkit.org/show_bug.cgi?id=8769 proudly announces this as part of a bug fix! This might explain the status of the W3C policy document; its current version is WebKit-minded in this issue, but other browser vendors either aren’t interested or knowingly oppose the idea of "early normalization."

我认为没有办法防止这种情况.但是您可以警告用户不要使用Chrome和Safari.您甚至可以使用包含简单问题案例的隐藏字段,然后检查服务器端是否按原样发送,并告诉用户是否更改了浏览器.

I don’t think there is a way to prevent this. But you could warn users against using Chrome and Safari. You could even use a hidden field containing a simple problem case, then check server side whether it was transmitted as−is, and tell the user to change browser if it isn’t.

固定订单服务器端并不简单,因为常见的标准化例程显然不支持所需的订单.您可以将其标准化为完全分解形式(NFD),然后使用您自己的代码为此目的对组合标记进行重新排序.也许更简单,更安全,您可以运行一个临时替换例程,该例程将组合标记的序列替换为其他序列.这样会更安全,因为它不会影响您想要影响的字符,而NFD会使用变音符号分解拉丁字母.

Fixing the order server-side isn’t simple, because common normalization routines apparently do not support the order needed. You could normalize to fully decomposed form (NFD), then reorder combining marks using your own code for the purpose. Perhaps simpler and safer, you could just run an ad hoc replacement routine that replaces sequences of combining marks with other sequences. This would be safer because it would not affect characters other than those you want to affect, whereas NFD decomposes Latin letters with diacritics, among other things.

根据Unicode原则,规范上等价的字符串(例如,仅在连续变音符号的顺序上有所不同)是同一数据的不同表示形式,但与Unicode字符序列(代码点)不同;预计它们的呈现方式不会有所不同,但它们可能并且经常会有所不同.通常,尽管程序可能会有所作为,但您不应期望程序将规范等效的字符串视为不同的字符串.请参阅 Unicode规范化常见问题解答.

According to Unicode principles, canonically equivalent strings (e.g., differing only in the order of consecutive diacritic marks) are different representations of the same data but distinct as sequences of Unicode characters (code points); they are not expected to differ in presentation, but they may, and often do. Generally, you should not expect programs to treat canonically equivalent strings as different, though programs may make a difference. See Unicode Normalization FAQ.

常见问题解答条目声称圣经希伯来文的问题已通过引入COMBINING GRAPHEME JOINER得以解决.尽管它阻止了Chrome中的重新排序,但它是一种笨拙的方法,并且可能会使渲染混乱(在网络浏览器中会这样做;变音符号可能会严重放错位置).

The FAQ entry claims that the problems of biblical Hebrew have been solved by the introduction of COMBINING GRAPHEME JOINER. Although it prevents the reordering in Chrome, it’s a clumsy method, and it may mess up rendering (it does in web browsers; diacritic marks may get badly misplaced).

这篇关于提交带有Unicode的表单时如何避免浏览器Unicode规范化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆