将非西方语言转换为Word中的HTML [英] convert non-western languages to HTML from Word

查看:66
本文介绍了将非西方语言转换为Word中的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的需要一些langauge转换为HTML的帮助。我的

翻译正在翻译成Word,我需要将Word转换为

HTML。自从我使用Unicode以来已经有一段时间了,并且知道

并非所有使用的字体都是unicode。有没有办法从MS Word过滤(是的)HTML中删除所有的垃圾邮件。我想要的只是

基本格式标签没有跨度,字体,div,css,但我不想要

丢失doctype或metatags中的任何语言标识或

方向性。任何帮助将不胜感激。这是一个

高度排名(谷歌)的非盈利网站。

谢谢

解决方案

< blockquote>在2008-01-19,annalisa< an ******* @ yahoo.comwrote:


我真的需要帮助langauge转换为HTML。我的

翻译正在翻译成Word,我需要将Word转换为

HTML。自从我使用Unicode以来已经有一段时间了,并且知道

并非所有使用的字体都是unicode。



别介意字体。你想要的Word文档是

字符。你需要弄清楚Word如何编码输出和

然后可能将其转码为UTF-8(你不必使用UTF-8但是

它'更简单)。


一个好的转码程序是iconv。


有没有办法剥离MS Word中的所有垃圾都被过滤掉了(是的

右)HTML。



我很幸运,不必从必须做的经验中说话,但我会从Python和BeautifulSoup开始。


我想要的只是

基本格式标签没有跨度,字体,div,css,但我不想要

会丢失doctype或metatags中的任何语言标识或

方向性。



方向性应该正常工作 - 字符从开始

存储到结束。并且它取决于浏览器从右到左排列它们或者在适当的情况下从左到右地支付



一个有趣的问题虽然你的作者是否使用了特殊的

字符,如RLO和RLE,以及他们是否有Word会将它们保存为Unicode字符。


然后你必须决定是否将它们留在输出中,或者用
用等效的unicode-bidi属性替换它们。我不知道

哪个浏览器支持更好。


2008年1月19日星期六,Ben C写道:


方向性应该起作用 - 字符从开始

存储到结束。并且它取决于浏览器从右到左排列它们或者在适当的情况下从左到右分配



方向性并不只是工作 - 相反,双向

算法是指七个控制或格式化字符并解释

如何使用它们。但是,在HTML中,您应该用DIR

标记替换它们。阅读更多内容

* http://www.unics.uni-hannover.de/nht.../if.tut.sc.www a ??


一个有趣的问题是,你的作者是否使用了特殊的

字符,如RLO和RLE,以及他们是否有Word将保存

将它们作为Unicode字符。

然后你必须决定是将它们保留在输出中,还是用
用等效的unicode-bidi属性替换它们。



通过unicode-bidi properties,你的意思是CSS属性吗?

通常,你应该更喜欢HTML标记( DIR属性)

CSS属性和Unicode控制字符。


-

在记忆中Alan J. Flavell
http://groups.google。 com / groups / sear ... Alan.J.Flavell


On Sun,2008年1月20日,Jukka K. Korpela写道:
< blockquote class =post_quotes>


>一个有趣的问题是你的作者是否使用了特殊字符如RLO和RLE,以及他们是否有Word
将它们保存为Unicode字符。



这可能是一个问题...但在我的测试中,RLO似乎无法工作

即使在Word中,



你必须先安装
http://www.microsoft.com/globaldev/h...pintlsupp.mspx
http://www.microsoft.com/globaldev/h...kintlsupp.mspx
< blockquote class =post_quotes>
为什么作者会使用它呢?



作者应该在HTML中避免使用这些Unicode字符:
http://www.unics.uni-hannover.de/nht...l-text#control

您可以使用像。这样的字符引用。查看
http:// www.unics.uni-hannover.de/nht...t-to-left.html

他们是否在您的浏览器中工作。


-

在记忆中Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell


I really need some help with langauge conversion to HTML. My
translators are translating into Word and I need to convert Word to
HTML. It''s been awhile since I''ve worked with Unicode and know that
not all the fonts being used are unicode. Is there a way to strip all
the junk out of the MS Word filtered (yeah right) HTML. All I want are
the basic formatting tags no spans,fonts, divs, css, but I don''t want
to lose any language identification in the doctype or metatags or
directionality. Any help would be greatly appreciated. This is for a
highly ranked (google) non-profit site.
Thanks

解决方案

On 2008-01-19, annalisa <an*******@yahoo.comwrote:

I really need some help with langauge conversion to HTML. My
translators are translating into Word and I need to convert Word to
HTML. It''s been awhile since I''ve worked with Unicode and know that
not all the fonts being used are unicode.

Never mind the fonts. What you want out of the Word docs is the
characters. You need to figure out how Word has encoded the output and
then probably transcode it to UTF-8 (you don''t have to use UTF-8 but
it''s simpler).

A good transcoding program is "iconv".

Is there a way to strip all the junk out of the MS Word filtered (yeah
right) HTML.

I am lucky enough not to be speaking from experience of having had to do
that but I would start with Python and BeautifulSoup.

All I want are
the basic formatting tags no spans,fonts, divs, css, but I don''t want
to lose any language identification in the doctype or metatags or
directionality.

Directionality should just work-- the characters are stored from "start"
to "end" and it''s up to the browser to lay them out right-to-left or
left-to-right where appropriate.

An interesting question though is whether your authors have used special
characters like RLO and RLE, and whether if they have Word will save
them out as the Unicode characters.

Then you have to decide whether to leave them in the output, or to
replace them with the equivalent unicode-bidi properties. I don''t know
which has better browser support.


On Sat, 19 Jan 2008, Ben C wrote:

Directionality should just work-- the characters are stored from "start"
to "end" and it''s up to the browser to lay them out right-to-left or
left-to-right where appropriate.

Directionality doesn''t "just work" - on the contrary, the bidirectional
algorithm refers to seven control or formatting characters and explains
how to use them. In HTML however, you should replace them with DIR
markup. Read more at
* http://www.unics.uni-hannover.de/nht.../if.tut.sc.www a??

An interesting question though is whether your authors have used special
characters like RLO and RLE, and whether if they have Word will save
them out as the Unicode characters.
Then you have to decide whether to leave them in the output, or to
replace them with the equivalent unicode-bidi properties.

By "unicode-bidi properties", do you mean "CSS properties"?
Normally, you should prefer HTML markup (DIR attribute) to
CSS properties and to Unicode control characters.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell


On Sun, 20 Jan 2008, Jukka K. Korpela wrote:

>An interesting question though is whether your authors have used
special characters like RLO and RLE, and whether if they have Word
will save them out as the Unicode characters.


That might be a problem... but in my test, RLO doesn''t seem to work
even in Word,

You must first install
http://www.microsoft.com/globaldev/h...pintlsupp.mspx
http://www.microsoft.com/globaldev/h...kintlsupp.mspx

so why would an author use it?

Authors should avoid these Unicode characters in HTML:
http://www.unics.uni-hannover.de/nht...l-text#control
You could use character references like ‫ . Check at
http://www.unics.uni-hannover.de/nht...t-to-left.html
whether they work in your browser.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell


这篇关于将非西方语言转换为Word中的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆