Microsoft Word DOC 和 DOCX 文件的字符编码? [英] Character encoding of Microsoft Word DOC and DOCX files?

查看:38
本文介绍了Microsoft Word DOC 和 DOCX 文件的字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不太熟悉 Microsoft Word 使用的编码.如果有人将 Word 中的 .doc 或 .docx 文件保存在何处,使用的标准编码是什么?

我猜它不是 UTF-8,因为结果文本(粘贴在 UTF-8 编码的文本文件中)不支持某些标点符号(例如引号).

例如,当粘贴到 UTF-8 文本文件中时,一个打开的 Word 'smart quote' 会产生一个 ì 符号.如果 Word 确实以 UTF-8 编码,那么 Word 如何尝试呈现实际的

更正

有点令人困惑,但我刚刚通过智能报价"意识到这一点您可能指的是 Word 必须表示大引号的机制.在我之前的回答中,我认为您的意思是反引号",这是另一回事.- 抱歉造成混乱.

好吧,无论如何,这里是这些智能引号的 unicode:

让我们将它们放在一个简单的 UTF-8 编码文本文件中.结果并没有那么壮观:

  • U+2018 以 UTF-8 编码为 E2 80 98
  • U+2019 以 UTF-8 编码为 E2 80 99
  • U+201C 以 UTF-8 编码为 E2 80 9C
  • U+201D 以 UTF-8 编码为 E2 80 9D

所以,我更进一步,将它们放入一个 word 文件中.我输入了一行带有常规引号的行,以及一行带有智能引号的行.

这是一个测试"这是另一个测试"

然后,我保存了这个东西并查看它是如何存储在 Word 的 xml 结构中的.实际上它完全按预期存储.

I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?

I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).

For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?

Edit

After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.

However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.

解决方案

These days a docx file is really a bunch of compressed xml files. One of these files, is the document.xml file, which starts with the following line (i.e. an xml prolog):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

As you can see, it's an UTF-8 encoding.

EDIT

UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.

And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.

Nevertheless, here's how word would store an ` and ì symbol.

CORRECTION

A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.

Well, anyway, here are the unicodes for these smart quotes:

Let's put them in a simple UTF-8 encoded text file. The result is not that spectacular:

  • U+2018 is encoded in UTF-8 as E2 80 98
  • U+2019 is encoded in UTF-8 as E2 80 99
  • U+201C is encoded in UTF-8 as E2 80 9C
  • U+201D is encoded in UTF-8 as E2 80 9D

So, I went 1 step further and put them in a word file. I entered a line with regular quotes, and one with smart quotes.

" this is a test " 
" this is another test "

And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.

这篇关于Microsoft Word DOC 和 DOCX 文件的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆