html tidy,word 2003和“smart quotes” [英] html tidy, word 2003 and "smart quotes"

查看:61
本文介绍了html tidy,word 2003和“smart quotes”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我正在加重时间获取html用Word吐出

2003在网页上正确显示。


这里的情况是创建文档的人只知道

Word,并不是很精通计算机。我创建了一个系统,他们可以将他们的Word文档保存为html。并将它们上传到某个

目录,并通过tidylib动态运行它们,使用

将整个扩展到php4,从而导致文档显示

正确。我还通过几个sed表达式运行文件,以便

删除没有业务的xml标签。


它几乎可以工作。生成的文档遵循页面的'css

规则并正确显示,除了那些翻译的智能引号。


如您所知,Word默认使用在网页上不起作用的编码替换直接引号与花哨

引号。当你将
另存为html时,生成的代码无法正确显示。你可以

关闭智能报价 (我已经建议)但是只计算* b $ b朝向*新*文件 - 现有文件仍有问题。


现在当我使用TidyUI时在Windows XP中,我可以看到花哨的报价将

变为直引号。但是当我在命令行上使用整洁或者通过php扩展程序使用
tidylib时,替换确实*不*取得
的地方。 (在每种情况下都是新下载的整洁版本。)


在Linux机器上,我有裸露,干净等等。和word-2000打开了。

(代码看起来有所不同,如果我把它们关闭,所以我肯定

他们正在开启。)看起来像什么归结为是,用相同的选项清理* b $ b,在Linux上清理*不同的东西,而不是在Windows上清理它们。

此时我的选择是什么?用户将继续使用Word

2003 - 没有帮助。我的网络服务器是Linux上的Apache - 这不是
会改变的。如何动态地从这里到那里,没有

用户干预?


非常感谢任何和所有建议。如果我能解决这个问题,

我已经降低了我们切换到IIS的可能性。


Ron(ro ** @ europa.com)

解决方案

Ron写道:

你好,我加重了得到html的时间用Word
2003喷出,以便在网页中正确显示。


根本不想用HTML代替HTML,但至少你试图将b $ b清理干净。

我创建了一个系统,他们可以将他们的Word文档保存为html
并将它们上传到某个目录,并且网页动态地通过tidylib运行它们......
规则并正确显示,除了那些翻译的智能引号。




没有什么

这些问题本质上是错误的,只是人们无法理解字符编码

问题。 Word文档默认以Windows-1252编码保存

。您所指的报价位于145

(a ??),146(a ??),147(a ??)和148(a ??)。但是,这些代码点(以及所有

其他在128到159之间的是ISO-8859-1中的控制代码和

其他。因此,主要问题是只是因为声明

不正确的字符编码而引起的。


尽管在HTTP标头中将编码声明为Windows-1252但

工作,不建议使用,因为Windows-1252是专为Windows设计的专有

编码(尽管支持可能已经添加到其他系统的
,但是''不保证)。


最好的选择是将文件保存为UTF-8并在HTTP标头中声明

编码,或继续使用ISO-8859-1并用数字字符

引用替换

引号(以及其他特殊的windows-1252字符)。我认为word确实有一个保存文件的选项我推荐的是UTF-8,




有关WIndows-1252和数字字符的更多信息参考

可用。
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html


-

Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ 重新发现网页
http://GetThunderbird.com/ 回收收件箱


Lachlan Hunt写道:

Ron写道:

你好,我正在加重时间获取html。用Word
2003用于在网页中正确显示。



根本不想用HTML代替HTML,但至少你试图清理干净它。



这不是他的想法......

当你使用一个WYSIWYG编辑器组件时(例如
)会出现同样的问题
HTMLArea)在网页上,人们从Word复制和粘贴东西 - 我讨厌这些

的东西(除了网络和WYSIWYG概念完全是

不兼容,它们只引起问题),但我无法确定嵌入所见即所得的编辑器的决定:(

< blockquote class =post_quotes>我创建了一个系统,他们可以将他们的Word文档保存为html
并将它们上传到某个目录,并且网页动态地通过tidylib运行它们。 。

它几乎可以工作。生成的文件遵循页面的'css
规则和d显示正确,除了那些翻译的智能引号。



卷曲引号没有任何内在错误,
它们的问题只是那些人无法正确理解字符编码问题。 Word文档默认以Windows-1252编码保存。您所指的报价位于145
(?),146(?),147(?)和148(?)的位置。但是,这些代码点(以及所有其他在128到159范围内的其他代码是ISO-8859-1中的控制代码和其他代码。因此,主要问题只是由声明
错误的字符编码。

尽管在HTTP标头中声明编码为Windows-1252,但不推荐使用,因为Windows-1252是专为编码而设计的仅限Windows(虽然支持可能也被添加到其他系统中,但这不是保证)。

最好的选择是将文件保存为UTF-8并声明在HTTP标头中编码,或继续使用ISO-8859-1并用数字字符
引用替换引号(以及其他特殊的windows-1252字符)。我认为单词确实我可以选择将文件保存为UTF-8,


有关WIndows-1252的更多信息和数字字符引用
可用。
<一个rel =nofollowhref =http: //www.cs.tut.fi/~jkorpela/www/windows-chars.html\"target =_ blank> http://www.cs.tut.fi/~jkorpela/www/windows-chars.html



我自己经常遇到这个问题而且我通常使用

str_replace表达式列表将这些字符转换为正确的&#。 ..;

同行。阅读Lachlan的评论之后,我的头脑中出现了一个未经考验的想法:你可以尝试使用PHP的iconv模块将

Windows-1252转换成UTF- 8,飞行。

我既没有Word也没有Windows,所以我现在无法测试...


-

Benjamin Niemann

电子邮件:粉红色at odahoda dot de

WWW: http://www.odahoda.de/


2005年4月14日星期四,Lachlan Hunt写道:

卷曲引号没有任何内在错误,与它们的问题只是人们无法正确理解字符编码问题。 Word文档默认以
Windows-1252编码保存。您指的是
的引号位于145(),146(),147()和148()的位置。


因此整齐地呈现了问题的另一个演示; - }

但是,这些代码点(以及所有其他代码点在128到
159是ISO-8859-1中的控制代码和其他。因此,主要的问题只是由声明不正确的字符编码引起的。


同意

虽然在HTTP标题中声明编码为Windows-1252可以使用,但不推荐使用,因为Windows-1252是专为Windows设计的专有编码(尽管支持也可能已经被添加到其他系统中,但这并不是保证的。


实际上,支持相当普遍,但我仍然会反对

使用它。

最好的选择是将文件保存为UTF-8并在HTTP标头中声明编码,或者继续使用ISO -8859-1和

数字字符替换引号(和其他特殊的windows-1252字符)引用。


我只是想确保没有人读到这个想到你

意味着字符引用,例如'等等:有趣的是,

历史上MS软件似乎比实际的8位字符更热情地生成了那些未定义的引用,但

未定义的引用非常虚假来自Unicode''从这个角度来看。


这些字符的正确Unicode

代码点都大于255,因为你明显是b
已经知道了(这里有一些官方的表格

与十六进制等值的
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT

我认为word确实可以选择将文件保存为UTF-8,我建议这样做。


我想这取决于你使用的版本。主题行

提到了2003年,但很多人还没有。
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html




很好的引用。


一切顺利


Hello, I''m having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.

The situation here is that the people creating the documents only know
Word, and aren''t very computer savvy. I created a system where they
can save their Word documents as "html" and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly. I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works. The resulting document follows the page''s css
rules and displays correctly, except for those durned "smart quotes".

As you know, Word defaults to replacing straight quotes with fancy
quotes using an encoding that doesn''t work on web pages. When you
"save as html", the resulting code doesn''t display correctly. You can
turn off "smart quotes" (which I have suggested) but that only counts
towards *new* documents -- existing documents still have the problem.

Now when I use TidyUI on Windows XP, I can SEE the fancy quotes turn
into straight quotes. But when I use tidy on the command line or
tidylib through the php extension, the substitution does *not* take
place. (Freshly downloaded version of tidy in every case.)

On the Linux box I have "bare", "clean" and "word-2000" turned on.
(The code looks different if I turn any of them off, so I''m sure
they''re getting turned on.) What it seems to come down to is that
tidy, with the same options, cleans up *different* things on Linux than
it does on Windows.

What are my options at this point? The users will continue to use Word
2003 -- no help there. My web server is Apache on Linux -- that''s not
going to change. How do I get from here to there, dynamically, with no
user intervention?

Thanks very much for any and all suggestions. If I can solve this,
I''ve made it that much less likely that we''ll switch to IIS.

Ron (ro**@europa.com)

解决方案

Ron wrote:

Hello, I''m having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.
Not a good idea to use word for HTML at all, but at least your trying to
clean it up.
I created a system where they can save their Word documents as "html"
and upload them to a certain directory, and the web page dynamically
runs them through tidylib...

It alllllmost works. The resulting document follows the page''s css
rules and displays correctly, except for those durned "smart quotes".



There''s nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(a??), 146 (a??), 147 (a??) and 148 (a??). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that''s not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox


Lachlan Hunt wrote:

Ron wrote:

Hello, I''m having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.



Not a good idea to use word for HTML at all, but at least your trying to
clean it up.


That was not his idea...
The same problem occurs, when you are using a WYSIWYG editor component (like
HTMLArea) on a webpage and people copy&paste stuff from Word - I hate these
things (beside the fact that the web and the WYSIWYG concept are completely
incompatible, they are only causing problems), but I was not able to
prevent the decision to embed WYSIWYG editors :(

I created a system where they can save their Word documents as "html"
and upload them to a certain directory, and the web page dynamically
runs them through tidylib...

It alllllmost works. The resulting document follows the page''s css
rules and displays correctly, except for those durned "smart quotes".



There''s nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(?), 146 (?), 147 (?) and 148 (?). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that''s not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html


I had this problem myself often enough and I usually used a list of
str_replace expressions to turn these characters into the corrent &#...;
counterparts. After reading Lachlan''s comment an untested idea popped up in
my head: you could try using the iconv module of PHP to convert the
Windows-1252 into UTF-8 on the fly.
I have neither Word nor Windows available, so I can''t test it now...

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/


On Thu, 14 Apr 2005, Lachlan Hunt wrote:

There''s nothing inherently wrong with the curly quotes, the problem
with them is only that people fail to understand the character
encoding issues properly. Word documents are saved in the
Windows-1252 encoding by default. The quotes you are referring to
are in the positions 145 (), 146 (), 147 () and 148 ().
Thereby neatly presenting yet another demonstration of the problem ;-}
However, these code points (and all others in the range from 128 to
159 are control codes in ISO-8859-1 and others. Thus, the main
problem is only caused by declaring the incorrect character
encoding.
agreed
Although declaring the encoding as Windows-1252 in the HTTP headers
will work, it is not recommended because Windows-1252 is a
proprietary encoding designed for windows only (although support may
have been added to other systems too, but that''s not guarenteed).
in fact, support is pretty widespread, but I''d still counsel against
using it.
The best options are to either save the files as UTF-8 and declare
that encoding in the HTTP headers or, continue to use ISO-8859-1 and
replace the quotes (and other special windows-1252 chars) with
numeric character references.
I just wanted to make sure that nobody reading this thought that you
meant character references such as ‘ etc. : funnily enough,
historically MS software seems to have generated those undefined
references more enthusiastically than the actual 8-bit characters, but
the undefined references are quite bogus from Unicode''s point of view.

The correct Unicode
code points for these characters are all greater than 255, as you
obviously already know (there''s a somewhat official table of them
with hex equivalents at
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT )
I think word does have an option to save files as UTF-8, which I
recommend.
I guess it depends on what version you''re using. The subject line
mentioned 2003, but plenty of folks aren''t there yet.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html



good cite.

all the best


这篇关于html tidy,word 2003和“smart quotes”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆