Unicode和html - 简单网站的帮助 [英] Unicode and html - help for simple web site

查看:60
本文介绍了Unicode和html - 简单网站的帮助的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个自制的网站,提供免费的
1100页物理教科书。它是用html和

css编写的。我最近添加了一些中文文本,并且

从那天开始出现问题。


输入页面有两个汉字,

但是在所有浏览器上都看不到这些,甚至

虽然页面通过w3c验证器验证了



http://www.motionmountain.net/welcome.html

(1)为什么不呢?


其他页面未在w3c中验证

http://www.motionmountain.net/contents.html

(2)这里有什么问题?


因为我打算添加更多语言,而且unicode

问题非常棘手:


(3 )当通过ftp上传unicode文件时,

必须使用哪些行结尾(mac,unix,other)?

Ascii模式还是二进制模式? (我有Mac OSX)


(4)要让IE读取页面,最好使用UTF-8

还是UTF-16?


感谢您的帮助!


Christoph


I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.

The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?

Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?

Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Thank you for any help!

Christoph

推荐答案

ch *********** @ yahoo.com 写道:
ch***********@yahoo.com wrote:
输入页面有两个中文字符,
但是这些在所有浏览器上都看不到,甚至
虽然页面是由
w3c验证器。
http://www.motionmountain.net/ welcome.html
(1)为什么不呢?


它的验证只是因为验证器是如此宽松而且不会因为你在meta标签中声明的编码之间的冲突而关心

(ISO-8859-1)和您实际使用的编码。事实上,我会说

验证器在这里出错:编码指定为

ISO-8859-1(当没有字符集时,元标记生效在

HTTP标头中指定),所以数据_must_的前两个八位字节被解释为
为t? (拉丁字母小刺和拉丁字母小y带

diaeresis),当然在

DOCTYPE声明出现之前违反HTML语法。

验证器错误地猜到了这个?用作UTF-16编码的字节

订单标记,因此将文档视为UTF-16编码的
。 (当然,从实用的角度来看,猜测是正确的,

但它仍然是一个错误。)


浏览器的行为可能在同样不正确的方式,或者他们可能正确

将文档解释为ISO-8859-1编码,在这种情况下,它在语法上是错误的,并且浏览器可以做他们喜欢的事情。这就是Lynx

显示的内容(除了编码

问题之外,还有一些其他问题的微妙暗示):

t?


。 jpg jpg

jpg

jpg jpg


MOTION MOUNTAIN


物理教科书


logo


欢迎目录下载搜索项目留言簿链接作者

奖品2005年7月5日
jpg jpg

jpg jpg


显然,如果你想使用UTF-16,请删除标签

< meta http-equiv =" Content-Type"含量=" text / html的; charset = ISO-8859-1">

或(或许更好)用UTF-16取代ISO-8859-1(在这种情况下你需要

来记住如果更改文档的编码,请再次更改。

其他页面未在w3c中验证
http://www.motionmountain.net/contents.html
(2)这里有什么问题?


1695错误,誓! :-)


我怀疑它们也与字符编码问题有关; 非SGML

字符编号0听起来像验证器遇到了NUL

字符(U + 0000)并且感到困惑,但如果我没记错的话,这个

的隐秘信息会在不同的情况下出现。 />

如何制作和编辑HTML文件?似乎他们可能不会全部采用UTF-16编码。

(3)当通过ftp上传unicode文件时,
哪些行结尾必须是用过(mac,unix,其他)?


在HTML中,所有常用的行结尾都是传统的(并且通过

规格)接受。

Ascii模式或二进制模式? (我有Mac OSX)


如果你使用UTF-16或UTF-8,二进制 - 你不需要任何

Mac-to-else转换,因为您已经使用标准的

Unicode编码。重要的是你的编辑软件

是否产生正确的UTF。

(4)要让IE阅读页面,最好使用UTF-8
或UTF-16?
The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?
It validates just because the validator is so permissive and does not
care about the conflict between the encoding you declare in the meta tag
(ISO-8859-1) and the encoding you actually use. In fact, I would say
that the validator is in error here: the encoding is specified as
ISO-8859-1 (the meta tag takes effect when no charset is specified in
HTTP headers), so the first two octets of the data _must_ be interpreted
as t? (Latin letter small thorn and Latin letter small y with
diaeresis), which of course violate HTML syntax when appearing before a
DOCTYPE declaration.

The validator incorrectly guesses that t? is meant to act as a byte
order mark in UTF-16 encoding and therefore treats the document as
UTF-16 encoded. (The guess is "correct" of course in a pragmatic sense,
but it''s still an error.)

Browsers may behave in the same incorrect way, or they may correctly
interpret the document as ISO-8859-1 encoded, in which case it is
syntactically wrong and browsers may do what they like. Here''s what Lynx
shows (there''s a subtle hint to some problems other than encoding
problems in my quoting this):

t?

. jpg jpg
jpg
jpg jpg

MOTION MOUNTAIN

THE PHYSICS TEXTBOOK

logo

Welcome Contents Download Search Project Guest Book Links Author
Prizes July 5, 2005
jpg jpg
jpg jpg

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16 (in which case you need
to remember to change it again if you change the document''s encoding).
Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?
1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.

How do you produce and edit your HTML files? It seems that they might
not all be properly UTF-16 encoded.
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
In HTML, all commonly used line endings are traditionally (and by the
specs) accepted.
Ascii mode or binary mode? (I have Mac OSX)
If you use UTF-16 or UTF-8, binary - you do _not_ want any
Mac-to-something else conversions, since you are using a standard
Unicode encoding already. What matters is whether your editing software
produces correct UTF-something.
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?




即使IE可以同时处理两者,但UTF-8肯定更有效率,如果

多数文字是英文的。在UTF-16中,每个(BMP)字符是两个

八位字节。



Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English. In UTF-16, every (BMP) character is two
octets.


Jukka K. Korpela写道:
Jukka K. Korpela wrote:
http:// www。 motionmountain.net/contents.html
(2)这里有什么问题?
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?



1695错误,誓! :-)

我怀疑它们也与字符编码问题有关; 非SGML
字符编号0听起来像验证器遇到了NUL
字符(U + 0000)并且感到困惑,但是如果我没记错的话,这个神秘的消息会在不同的情况下出现。



1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.




在这种情况下,错误信息是正确的:文档包含数据,例如

< td>& nbsp;< / td>

这样每个字符都跟着NUL,U + 0000。

验证器报告NUL是一个错误,因为它是一个非SGML字符,

这是我现在不会深入研究的技术性问题。问题显然是数据来源,可能是通过服务器端包含(作为评论

之前建议)来自转换为Unicode的ASCII文件

格式过于急切。如果要将ASCII数据嵌入到UTF-16
编码文档中,则每个八位字节后面应跟一个零八位字节。

在这里发生的是每个八位字节后跟_three_零八位字节(如果编码是UTF-32则为

),这意味着在UTF-16解释中

你周围都有NUL。虽然浏览器可能会跳过NUL,但是HTML中的NUL是一个

错误。


所以也许有一些简单的ASCII到UTF-16转换是

错误地应用_twice_,或者可能有一个ASCII到UTF-32

转换。



In this case, the error message is correct: the document contains data like
<td>&nbsp;</td>
so that each of those characters is followed by NUL, U+0000. The
validator reports NUL as an error, since it is a "non SGML character",
which is a technicality I won''t dig into now. The problem is apparently
that the data comes, presumably via server-side include (as the comment
before it suggests) from an ASCII file that is converted to Unicode
format too eagerly. If you have ASCII data to be embedded into an UTF-16
encoded document, each octet shall be followed by a zero octet. What has
happened here is that each octet is followed by _three_ zero octets (as
if the encoding were UTF-32), which means in UTF-16 interpretation that
you have NULs all around. Although browsers may skip NULs, NULs are an
error in HTML.

So perhaps there is some simple ASCII to UTF-16 transformation that is
applied _twice_ by mistake, or maybe there is an ASCII to UTF-32
transformation.


8月25日星期四2005年,Jukka K. Korpela写道:


[...]
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

[...]
(4)要获得IE浏览页面,最好是使用UTF-8
还是UTF-16?
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?


即使IE也可以同时处理,但UTF-8肯定更有效率,如果
大部分文本都是英文。



Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English.




同意,并且utf-8一般来说比utf-16更好支持

(不仅谈论浏览器,还谈论搜索引擎等)。


但是,*如果*大部分内容都是中文的,大概是

utf-16比utf-8更紧凑。 (我的印象是

目前,大多数中文文件都是中文编码中的一种,而不是Unicode,但是那是'by by。哦,和使用Unicode时的
,记得指定语言,以帮助浏览器为b / b
选择统一汉字的首选渲染[1]。)


希望这有帮助


[1]这根本不是我的领域,但网络搜索引发了一个

维基百科文章据我所知,在我能理解的水平上似乎是一个合理的讨论。
http://en.wikipedia.org/wiki/Han_unification


我不会说话个人对于它的任何技术细节 - 比如任何

维基百科的文章,谁知道这个领域的专家会说什么?b $ b来说一下呢?据我所知,它可能是一尘不染,我只是不能告诉你。但至少它给出了所涉及问题的风格。



Agreed, and utf-8 is, in general, better supported than utf-16
(talking not only about browsers, but also about search engines etc.).

However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8. (I have the impression that
currently, most Chinese documents are in one of the specifically
Chinese encodings, rather than Unicode, but that''s by the by. Oh, and
when using Unicode, remember to specify the language, to help browsers
to choose a preferred rendering for unified Han characters[1]).

hope this helps

[1] This is not my field at all, but a web search throws up a
wikipedia article which, as far as I can tell, seems to be a
reasonable discussion at a level that I can understand.
http://en.wikipedia.org/wiki/Han_unification

I can''t speak personally for any of its technical detail - like any
wikipedia article, who knows what a specialist in the field would have
to say about it? For all I know, it may be spotless, I just can''t
tell; but at least it gives the flavour of the issues involved.


这篇关于Unicode和html - 简单网站的帮助的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆