输入字符集处理 [英] Input Character Set Handling

查看:74
本文介绍了输入字符集处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我很难找到关于IE 5.5,6和7

如何处理字符输入的确切信息(我很满意显示文本)。

我有两个主要问题:

1. IE自动转换HTML格式的文本输入是否来自

原生字符集(例如SJIS,8859-1等)在发送

输入回服务器之前的UTF-8?


2. IE Javascript是否也这样做?因此,如果我编写一个Javascript函数

将UTF-8字符串与用户输入的字符串比较为

a文本框,IE将转换用户的在进行比较之前将字符串转换为UTF-8

比较?

我认为问题1的答案可能是是,但我不能

找到关于问题2的任何信息!

非常感谢你的帮助

Kulgan。

解决方案

Kulgan写道:


1. IE自动转换HTML格式的文本输入是否来自

本机字符集(例如SJIS,8859-1等)在输入

输入回服务器之前到UTF-8?



使用< form method =" get" ,浏览器会尝试将字符

传递给页面字符集中的服务器,但如果有问题的字符可以表示,那么它只会成功获得


字符集。如果不是,浏览器计算出他们最好的选择。基于

什么是可用的(旧式)或使用Unicode设置(新风格)。


示例:西方浏览器发送''é' 'as''%E9''默认(URL编码)。

但是当页面是UTF-8时,浏览器将首先查找

Unicode多字节编码''é''。在这种情况下,它是2个字节

,因为''é''位于UTF代码点范围128-256。这两个字节

对应?和?,并将导致''%C3%A9''(URL编码)在最终查询字符串中。


< form method = "交" ENCTYPE = QUOT;应用/ X WWW的窗体-urlencoded"

与< form method =" post"相同并使用相同的一般原则

作为GET。


In< form method =" post" ENCTYPE = QUOT;多部分/格式数据"根本没有

默认编码,因为这种编码类型需要能够传输非base64-ed二进制文件。 ''é''将被传递为''é''并且'

它。


2.是否有IE Javascript照着做?因此,如果我编写一个Javascript函数

将UTF-8字符串与用户输入的字符串比较为

a文本框,IE将转换用户的字符串转换成UTF-8之前做了比较?
比较?



浏览器仅在用户提交表单的时刻与发出新POST / GET请求的时刻之间编码表单值。

只要你没有发送表格,你就可以使用

javascript中的任何Unicode字符。

希望这会有所帮助,


-

巴特



浏览器仅在用户提交表单的时刻与发出新POST / GET请求的时刻之间编码表单值。 />
只要你还没有发送表格,你就可以使用

javascript中的任何Unicode字符。



感谢您提供有用的信息。


在Javascript主题上,如果用户的输入字符集不是

UTF- 8(例如,它是日语SJIS集),但是页面字符集是
UTF-8,Javascript如何看到字符?在使用它们之前,浏览器是否对字符进行了一次

SJIS到UTF-8的转换(例如

来查找字符串的长度?)


谢谢,


Kulgan。


Kulgan写道:
< blockquote class =post_quotes>
2. IE Javascript是否也这样做?因此,如果我编写一个Javascript函数

将UTF-8字符串与用户输入的字符串比较为

a文本框,IE将转换用户的字符串转换成UTF-8之前做了比较?
比较?



这是由Unicode,Inc。和W3C启发的混乱(我很想知道

,如果他们有任何线索的话关于Unicode)。


Unicode是* charset *:一组字符,其中每个字符单元

由两个字节表示(取原始Unicode 16 -bit

编码)。同时TCP / IP协议是一个8位媒体:它的原子单位是一个字节。这样就无法通过互联网直接发送Unicode

实体:就像你不能在一张纸上放置一个3D盒子一样,你只能模仿它(制作)它的2D投影)。所以

有必要使用一些8位*编码*算法将Unicode

字符分割成字节序列,通过互联网发送它们

将它们粘在另一端。这里UTF-8 *编码*(不是

,并将它们发送给

收件人。收件人 - 提前通知内容类型标题

即将发布的内容 - 使用UTF-8解码器取回原始的Unicode

字符。

大多数专家都不知道的事实编号,包括绝对多数的W3C志愿者
,所以感觉你只选了一个

一个:-) -

Pragma<?xml version =" 1.0" encoding =" utf-8"?哪一个看到左边和

正确的XML和伪XHTML文档*并不意味着这个

文件是UTF- 8编码。这意味着该文档采用Unicode

字符集,并且必须使用UTF-8编码算法在8位媒体上传输(如果需要)
。如果文件不是使用Unicode字符集而不是b $ b,那么你正在做一个虚假的声明

如果曾经在互联网上使用过多的令人讨厌的结果。

这里有更多的秘密知识,我自己和Sir先生共享

Berners-Lee只:-) -

< meta http-equiv =" content-type" ;含量=" text / html的; charset = UTF-8">

*不表示你在屏幕上看到的字符是

" UTF-8 charset" (没有这样的)。它的意思是:输入流是

声明为使用UTF-8传输编码的Unicode字符集字符

编码。您看到的结果(如果看到任何内容)是使用UTF-8解码器解码输入流的结果



" charset"这里的术语完全是误导性的 - 它仍然来自

旧时代,最多有256个实体的字符集,因此编码匹配

字符集,反之亦然。正确的标题W3C应该坚持是

.... content =" text / html;字符集= Unicode的;编码= UTF-8"

正如我之前所说的那样,地球上很少有人知道真相而且网站迄今为止没有崩溃,主要原因有两个:

1)服务器发送的Content-Type标头优先于页面

上的META标签。这个HTTP标准是

父亲留给我们的最有价值的标准之一。他们提前看到了无知的裁决,因此有机会拯救世界:

2)所有现代UA都有内置的特殊神经系统排除真实的

UTF-8输入流和作者错误。关于内容类型

在我心中的注释adepts:这意味着在过去的几年中产生了大量的依赖于查看者的XML / XHTML文档。


很抱歉这么长的序言,但我认为危险的是

只是继续给予简短修复。建议:它与症状对抗

而不是疾病。而且疾病在全球范围内不断增长:

帮助台充斥着诸如我的文件是UTF-8

编码,为什么......之类的请求。


回到原来的问题:页面将是Unicode

或ISO-8859-1或其他:但它*从不*将是UTF-8:UTF-8

仅在传输和解析阶段存在。最大的一个

可以做的就是在文档中使用UTF-8编码的字符,例如

%D0%82 ...但是在这种情况下它只是行UTF -8 source使用ASCII字符集代表



>另一方面,JavaScript只使用Unicode运行,它看到



页面内容" ;通过Unicode的窗口无论

实际的字符集是什么。因此,为了可靠地比较用户输入/节点值与

JavaScript字符串,你必须:

1)最可靠的一个用于平均 - 少量的非ASCII

字符:

使用\ u Unicode转义序列


2)较小的可靠性,因为一旦打开就很容易被破坏Unicode

编辑器:

将Unicode中的整个.js文件与非ASCII字符键入为

它们和您的服务器发送文件采用UTF-8编码。


PS还有另一个问题,可以命名为我如何处理

Unicode 32位字符或Unicode,Inc。如何搞定整个

世界。但是你的主要问题已经得到解答了,无论如何,这是啤酒时间

。 :-)


Hi

I am struggling to find definitive information on how IE 5.5, 6 and 7
handle character input (I am happy with the display of text).
I have two main questions:
1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user''s string into UTF-8 before doing
the comparison?
I think that the answer to question 1 is probably "YES", but I cannot
find any information on question 2!
Many thanks for your help
Kulgan.

解决方案

Kulgan wrote:

1. Does IE automaticall convert text input in HTML forms from the
native character set (e.g. SJIS, 8859-1 etc) to UTF-8 prior to sending
the input back to the server?

With <form method="get" , the browser tries to pass the characters
to the server in the character set of the page, but it will only
succeed if the characters in question can be represented in that
character set. If not, browsers calculate "their best bet" based on
what''s available (old style) or use an Unicode set (new style).

Example: western browsers send ''é'' as ''%E9'' by default (URL encoding).
But when the page is in UTF-8, the browser will first lookup the
Unicode multibyte encoding of ''é''. In this case, it are 2 bytes
because ''é'' lies in UTF code point range 128-256. Those two bytes
correspond to ? and ?, and will result in ''%C3%A9'' (URL encoding) in
the eventual query string.

<form method="post" enctype="application/x-www-form-urlencoded" is
the same as <form method="post" and uses the same general principle
as GET.

In <form method="post" enctype="multipart/form-data" there is no
default encoding at all, because this encoding type needs to be able to
transfer non-base64-ed binaries. ''é'' will be passed as ''é'' and that''s
it.

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user''s string into UTF-8 before doing
the comparison?

Browsers only encode form values between the moment that the user
submits the form and the moment that the new POST/GET request is made.
You should have no problem to use any of the Unicode characters in
javascript as long as you haven''t sent the form.

Hope this helps,

--
Bart


Browsers only encode form values between the moment that the user
submits the form and the moment that the new POST/GET request is made.
You should have no problem to use any of the Unicode characters in
javascript as long as you haven''t sent the form.

Thanks for the helpful info.

On the Javascript subject, if the user''s input character set is not
UTF-8 (e.g. it is the Japanese SJIS set), but the page character set is
UTF-8, how does Javascript see the characters? Does the browser do an
SJIS to UTF-8 conversion on the characters before they are used (e.g.
to find the length of the string?)

Thanks,

Kulgan.


Kulgan wrote:

2. Does IE Javascript do the same? So if I write a Javascript function
that compares a UTF-8 string to a string that a user has inputted into
a text box, will IE convert the user''s string into UTF-8 before doing
the comparison?

That is confusion inspired by Unicode, Inc. and W3C (I''m wondering
rather often if they have any clue at all about Unicode).

Unicode is a *charset* : a set of characters where each character unit
is represented by two bytes (taking the original Unicode 16-bit
encoding). At the same time TCP/IP protocol is an 8-bit media: its
atomic unit is one byte. This way one cannot directly send Unicode
entities over the Internet: same way as you cannot place a 3D box on a
sheet of paper, you can only emulate it (making its 2D projection). So
it is necessary to use some 8-bit *encoding* algorithm to split Unicode
characters onto sequences of bytes, send them over the Internet and
glue them back together on the other end. Here UTF-8 *encoding* (not
*charset*) comes into play. By some special algorithm it encodes
Unicode characters into base ACSII sequences and send them to the
recipient. The recipient - informed in advance by Content-Type header
what i''s coming - uses UTF-8 decoder to get back the original Unicode
characters.
The Fact Number One unknown to the majority of specialists, including
the absolute majority of W3C volunteers - so feel yourselve a choosen
one :-) -
Pragma <?xml version="1.0" encoding="utf-8"?which one sees left and
right in XML and pseudo-XHTML documents *does not* mean that this
document is in UTF-8 encoding. It means that the document is in Unicode
charset and it must be transmitted (if needed) over an 8-bit media
using UTF-8 encoding algorithm. Respectively if the document is not
using Unicode charset then you are making a false statement with
numerous nasty outcomes pending if ever used on the Internet.
Here is even more secret knowledge, shared between myself and Sir
Berners-Lee only :-) -
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
*does not* mean that the characters you see on your screen are in
"UTF-8 charset" (there is not such). It means: "The input stream was
declared as Unicode charset characters encoded using UTF-8 transport
encoding. The result you are seeing (if seeing anything) is the result
of decoding the input stream using UTF-8 decoder".
"charset" term here is totally misleading one - it remained from the
old times with charsets of 256 entities maximum thus encoding matching
charset and vice versa. The proper header W3C should insist on is
....content="text/html; charset=Unicode; encoding=UTF-8"
As I said before very few people on the Earth knows the truth and the
Web did not collapse so far for two main reason:
1) Content-Type header sent by server takes precedence over META tag on
the page. This HTTP standard is one of most valuable ones left to us by
fathers. They saw in advance the ignorance ruling so left the chance to
server admins to save the world :-)
2) All modern UA''s have special neuristic built in to sort out real
UTF-8 input streams and authors mistakes. A note for the "Content-Type
in my heart" adepts: it means that over the last years a great amount
of viewer-dependant XML/XHTML documents was produced.

Sorry for such extremely long preface, but I considered dangerous to
just keep giving "short fix" advises: it is fighting with symptoms
instead of the sickness. And the sickness is growing worldwide: out
helpdesk is flooded with requests like "my document is in UTF-8
encoding, why..." etc.

Coming back to your original question: the page will be either Unicode
or ISO-8859-1 or something else: but it *never* will be UTF-8: UTF-8
exists only during the transmission and parsing stages. The maximum one
can do is to have UTF-8 encoded characters right in the document like
%D0%82... But in such case it is just row UTF-8 source represented
using ASCII charset.

>From the other side JavaScript operates with Unicode only and it sees

the page content "through the window of Unicode" no matter what the
actual charset is. So to reliably compare user input / node values with
JavaScript strings you have to:
1) The most reliable one for an average-small amount of non-ASCII
characters:
Use \u Unicode escape sequences

2) Lesser reliable as can be easily smashed once open in a non-Unicode
editor:
Have the entire .js file in Unicode with non-ASCII characters typed as
they are and your server sending the file in UTF-8 encoding.

P.S. There is whole another issue which could be named "How do I handle
Unicode 32-bit characters or How did Unicode, Inc. screw the whole
world". But your primary question is answered, and it''s beer time
anyway. :-)


这篇关于输入字符集处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆