Microsoft Word DOC 和 DOCX 文件的字符编码? [英] Character encoding of Microsoft Word DOC and DOCX files?
问题描述
我不太熟悉 Microsoft Word 使用的编码.如果有人将 Word 中的 .doc 或 .docx 文件保存在何处,使用的标准编码是什么?
我猜它不是 UTF-8,因为结果文本(粘贴在 UTF-8 编码的文本文件中)不支持某些标点符号(例如引号).
例如,当粘贴到 UTF-8 文本文件中时,一个打开的 Word 'smart quote' 会产生一个 ì
符号.如果 Word 确实以 UTF-8 编码,那么 Word 如何尝试呈现实际的
更正
有点令人困惑,但我刚刚通过智能报价"意识到这一点您可能指的是 Word 必须表示大引号的机制.在我之前的回答中,我认为您的意思是反引号",这是另一回事.- 抱歉造成混乱.
好吧,无论如何,这里是这些智能引号的 unicode:
让我们将它们放在一个简单的 UTF-8 编码文本文件中.结果并没有那么壮观:
U+2018
以 UTF-8 编码为E2 80 98
U+2019
以 UTF-8 编码为E2 80 99
U+201C
以 UTF-8 编码为E2 80 9C
U+201D
以 UTF-8 编码为E2 80 9D
所以,我更进一步,将它们放入一个 word 文件中.我输入了一行带有常规引号的行,以及一行带有智能引号的行.
这是一个测试"这是另一个测试"
然后,我保存了这个东西并查看它是如何存储在 Word 的 xml 结构中的.实际上它完全按预期存储.
I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?
I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).
For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì
symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?
Edit
After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.
However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.
These days a docx
file is really a bunch of compressed xml files. One of these files, is the document.xml
file, which starts with the following line (i.e. an xml prolog):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
As you can see, it's an UTF-8 encoding.
EDIT
UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.
And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.
Nevertheless, here's how word would store an ` and ì symbol.
CORRECTION
A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.
Well, anyway, here are the unicodes for these smart quotes:
Let's put them in a simple UTF-8 encoded text file. The result is not that spectacular:
U+2018
is encoded in UTF-8 asE2 80 98
U+2019
is encoded in UTF-8 asE2 80 99
U+201C
is encoded in UTF-8 asE2 80 9C
U+201D
is encoded in UTF-8 asE2 80 9D
So, I went 1 step further and put them in a word file. I entered a line with regular quotes, and one with smart quotes.
" this is a test "
" this is another test "
And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.
这篇关于Microsoft Word DOC 和 DOCX 文件的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!