vbscript filesystemobject 如何编码字符? [英] How does vbscript filesystemobject encode characters?

查看:40
本文介绍了vbscript filesystemobject 如何编码字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个 vbscript 代码:

I have this vbscript code:

    Set fs = CreateObject("Scripting.FileSystemObject")
    Set ts = fs.OpenTextFile("tmp.txt", 2, True)

    for i = 128 to 255
        s = chr(i)
        if lenb(s) <>2 then
            wscript.echo i
            wscript.quit
        end if
        ts.write s
    next
    ts.close

在我的系统上,每个整数都被转换为一个双字节字符:该范围内没有不能用字符表示的数字,也没有数字需要超过 2 个字节.但是当我查看文件时,我发现只有 127 个字节.

On my system, each integer is converted to a double byte character: there are no numbers in that range that cannot be represented by a character, and no number requires more than 2 bytes. But when I look at the file, I find only 127 bytes.

这个答案:https://stackoverflow.com/a/31436726/1335492 建议 FSO 创建 UTF 文件并插入 BOM.但该文件只包含 127 个字节,并且没有字节顺序标记.

This answer: https://stackoverflow.com/a/31436726/1335492 suggests the the FSO creates UTF files and inserts a BOM. But the file contains only 127 bytes, and no Byte Order Mark.

FSO 如何决定如何编码文本?什么编码允许 8 位单字节字符?哪些编码包含 255 个 8 位单字节字符?

How does FSO decide how to encode text? What encoding allows 8 bit single-byte characters? What encodings do not include 255 8 bit single-byte characters?

(关于 FSO 如何读取字符的答案也可能很有趣,但这不是我在这里特别要问的)

(Answers about how FSO reads characters may also be interesting, but that's not what I'm specifically asking here)

我已将我的问题限制在高位字符上,以明确问题是什么.(关于低位字符的答案可能也很有趣,但这不是我在这里特别要问的)

I've limited my question to the high-bit characters, to make it clear what the question is. (Answers about the low-bit characters may also be interesting, but that's not what I'm specifically asking here)

推荐答案

简答:

文件系统对象映射Unicode"到ASCII"使用与系统区域设置相关联的代码页.(Chr 和 ChrW 使用用户区域设置.)

Short Answer:

The file system object maps "Unicode" to "ASCII" using the code page associated with the System Locale. (Chr and ChrW use the User Locale.)

系统代码页和线程(用户)代码页之间可能存在静默换位错误.如果代码页中缺少代码点,或者与日语和 UTF-8 一样,代码页包含多字节字符,也可能会出现编码和解码错误.

There may be silent transposition errors between the System code page and the Thread (user) code page. There may also be coding and decoding errors if code points are missing from a code page, or, as with Japanese and UTF-8, the code pages contain multi-byte characters.

VBscript 不提供检测用户、线程或系统代码页的本机方法.线程(用户)代码页可能从 SetLocale 设置的 Locale 推断出来或由 GetLocale 返回(这里有一个列表:https://www.science.co.il/language/Locale-codes.php),但似乎没有任何 MS 文档.在 Win2K+ 上,WMI 可用于查询系统代码页.CHCP 命令查询和更改 OEM 代码页,该代码页既不是用户代码页,也不是系统代码页.

VBscript provides no native method to detect the User, Thread, or System code page. The Thread (user) code page maybe inferred from the Locale set by SetLocale or returned by GetLocale (there is a list here: https://www.science.co.il/language/Locale-codes.php), but there does not appear to be any MS documentation. On Win2K+, WMI may be used to query the System code page. The CHCP command queries and changes the OEM codepage, which is neither the User nor the System code page.

系统代码页可能被应用程序清单欺骗.应用程序(例如 cscript 或 wscript)或脚本(例如 VBScript 或 JScript)无法更改其父系统,除非使用新清单创建新进程.或在更改注册表后重新启动系统.

The system code page may be spoofed by an application manifest. There is no way for an application (such as cscript or wscript) or script (such as VBScript or JScript) to change it's parent system except by creating a new process with a new manifest. or rebooting the system after making a registry change.

 s = chr(i) 
'creates a Unicode string, using the Thread Locale Codepage. 

作为字符不存在的代码点被映射为控制字符:127 变为 U+00FF(这是标准的 Unicode 控制字符),128 变为 U+20AC(欧元符号),129 变为 0081(即Unicode 控制字符区域中的代码点).在 VBScript 中,可以通过 SetLocale 和 GetLocale 设置和读取 Thread Locale

Code points that do not exist as characters are mapped as control characters: 127 becomes U+00FF (which is a standard Unicode control character), and 128 becomes U+20AC (the Euro symbol) and 129 becomes 0081 (which is a code point in a Unicode control character region). In VBScript, Thread Locale can be set and read by SetLocale and GetLocale

    createobject("Scripting.FileSystemObject").OpenTextFile(strOutFile, 2, True).write s
   'creates a 'code page' string, using the System Locale Codepage. 

Windows 可以通过两种方式处理它无法映射的 Unicode 值:它可以映射到默认字符,或者返回错误.Scripting.FileSystemObject"使用错误设置,并抛出异常.

There are two ways that Windows can handle Unicode values it can't map: it can either map to a default character, or return an error. "Scripting.FileSystemObject" uses the error setting, and throws an exception.

线程区域设置默认为用户区域设置,即区域和语言"中的日期和时间格式设置.控制面板小程序(在不同版本的 Windows 中称为不同的东西).它有一个关联的代码页.根据 MS 国际化专家 Michka(Michael Kaplan,RIP)的说法,它有代码页的原因是为了可以用适当的字符编写月份和星期几,并且不应将其用于任何其他目的.

The Thread Locale is, by default, the User Locale, which is the date and time format setting in the "Region and Language" control panel applet (called different things in different versions of windows). It has an associated code page. According to MS internationalization expert Michka (Michael Kaplan, RIP), the reason it has a code page is so that Months and Days of the week can be written in appropriate characters, and it should not be used for any other purpose.

ASP-classic 的人显然有其他想法,因为 Response.CodePage 是线程区域设置,并且可以通过 vbscript GetLocale 和 SetLocale 等方法控制.如果 User Locale 更改,则会通知所有进程,并且任何使用默认值的线程都会更新.(我还没有测试当前使用非默认值的线程会发生什么).

The ASP-classic people clearly had other ideas, since Response.CodePage is thread-locale, and can be controlled by vbscript GetLocale and SetLocale amongst other methods. If the User Locale is changed, all processes are notified, and any thread that is using the default value updates. (I haven't tested what happens to a thread currently using a non-default value).

系统区域设置也称为非 Unicode 程序的语言".也可以在区域和语言"中找到.小程序,但需要重新启动才能更改.这是 windows(系统")在内部使用的值,用于在A"和A"之间进行映射.API 和W"应用程序接口.更改此项对 Windows GUI 的语言没有影响(即不是非 Unicode 程序")

The System Locale is also called "Language for non-Unicode programs" and is also found in the "Region and Language" applet, but requires a reboot to change. This is the value used internally by windows ("The System") to map between the "A" API and the "W" API. Changing this has no effect on the language of the Windows GUI (That is not a "non-Unicode program")

假设时间和日期"设置匹配非 Unicode 程序的语言",任何可以创建有效 Unicode 代码点的 Chr(i)(请参阅下面的映射错误"),都将从 Unicode 准确映射回代码"页".请注意,这确实适用于控制字符"的代码点:另请注意,它不适用于其他方式:UTF-CodePage-UTF 并不总是完全往返.著名的 (Character,Modifer)-CodePage-(Complex Character) 不能正确往返,其中 Unicode 定义了不止一种构建语言字符表示的方法.

Assuming that the "Time and Date" setting matches the "Language for non-Unicode programs", any Chr(i) that can create a valid Unicode code point (see "mapping errors" below), will map back exactly from Unicode to "code page". Note that this does work for code points that are "control characters": also note that it doesn't work the other way: UTF-CodePage-UTF doesn't always round-trip exactly. Famously (Character,Modifer)-CodePage-(Complex Character) does not round-trip correctly, where Unicode defines more than one way of constructing a language character representation.

如果时间和日期"不匹配非 Unicode 程序的语言",任何翻译都可能发生,例如 U+0101 在 cp28594 上为 0xE0,在 cp28603 上为 0xE2:Chr(224) 将通过 U+0101 成为写成 226.

If the "Time and Date" does not match the "Language for non-Unicode programs", any translation could take place, for example U+0101 is 0xE0 on cp28594 and 0xE2 on cp28603: Chr(224) would go through U+0101 to be written as 226.

即使没有换位错误,如果时间和日期"不存在与非 Unicode 程序的语言"不匹配;程序在转换为系统区域设置时可能会失败:如果 Unicode 代码点没有匹配的代码页代码点,则 FileSystemObject 将出现异常.

Even if there are not transposition errors, if the "Time and Date" does not match the "Language for non-Unicode programs" the program may fail when translating to the System Locale: if the Unicode code point does not have a matching Code Page code point, there will be an exception from the FileSystemObject.

Chr(i) 也可能存在映射错误,从代码页到 Unicode.代码页 1041(日语)是双字节代码页(可能是 Shift JIS).0x81 是(仅)双字节对的第一个字节.为了与其他代码页保持一致,0x81 应该映射到控制字符 0081,但是当给定 81 和代码页 1041 时,Windows 假定缓冲区或 BSTR 中的下一个字节是双字节的第二个字节对(我还没有确定错误是在转换之前还是之后发生的).Chr(&H81) 映射到 U+xx81 (81,xx).当我这样做时,我得到了 U+4581,这是一个 CJK 统一表意文字 (Brasenia purpurca):它没有被代码页 1041 映射.

There may also be mapping errors at Chr(i), going from Code page to Unicode. Code page 1041 (Japanese) is a double-byte code page (probably Shift JIS). 0x81 is (only) the first byte of a double-byte pair. To be consistent with other code pages, 0x81 should map to the control character 0081, but when given 81 and code page 1041, Windows assumes that the next byte in the buffer, or in the BSTR, is the second byte of the double-byte pair (I've not determined if the mistake is made before or after the conversion). Chr(&H81) is mapped to U+xx81 (81,xx). When I did it, I got U+4581, which is a CJK Unified Ideograph (Brasenia purpurca): it's not mapped by code page 1041.

Chr(1) 上的映射错误不会在创建时导致 VBScript 异常.如果创建的 UTF-16 代码点在 System Locale 代码页上无效或不存在,则 .write 处将出现 FileSystemObject 异常.通过使用 ChrW(i) 而不是 Chr(i) 可以避免这个特殊问题.在代码页 1041 上,ChrW(129) 变成了 Unicode 控制字符 0081 而不是 xx81.

Mapping errors at Chr(1) do not cause VBScript exceptions at the point of creation. If the UTF-16 code point created is invalid or not on the System Locale code page, there will be a FileSystemObject exception at .write. This particular problem can be avoided by using ChrW(i) instead of Chr(i). On code page 1041, ChrW(129) becomes the Unicode Control character 0081 instead of xx81.

程序可以在 Unicode 和代码页"之间进行映射.使用任何已安装的代码页:Windows 函数 MultiByteToWideCharWideCharToMultiByte 将 [UINT CodePage] 作为第一个参数.该机制在 Windows 内部使用以映射A"W"的APIAPI,例如 GetAddressByNameA 和 GetAddressByNameW.Windows 在内部是W",(宽,16 位),A"是A".字符串被映射到W"字符串随叫随到,从W"回来到A"返回时.当 Windows 进行映射时,它使用与系统区域设置"相关联的代码页,也称为非 Unicode 程序的语言".

A program can map between Unicode and "codepage" using any installed code page: the Windows functions MultiByteToWideChar and WideCharToMultiByte take [UINT CodePage] as the first parameter. That mechanism is used internally in Windows to map the "A" API to the "W" API, for example GetAddressByNameA and GetAddressByNameW. Windows is "W", (wide, 16 bit) internally, and "A" strings are mapped to "W" strings on call, and back from "W" to "A" on return. When Windows does the mapping, it uses the code page associated with the "System Locale", also called "Language for non-Unicode programs".

Windows API 函数 WriteFile 写入字节,而不是字符,因此它不是A"或W"功能.任何使用它的程序都必须处理字符串和字节之间的转换.c 函数 fwrite 写入字符,因此它可以处理 16 位字符,但它无法处理像 UTF-8 或 UTF-16 这样的可变长度代码点:再次,任何使用;"的程序.fwrite"必须处理字符串和单词之间的转换.

The Windows API function WriteFile writes bytes, not characters, so it's not an "A" or "W" function. Any program that uses it has to handle conversion between strings and bytes. The c function fwrite writes characters, so it can handle 16 bit characters, but it has no way of handling variable length code points like UTF-8 or UTF-16: again, any program that uses "fwrite" has to handle conversion between strings and words.

C++ 函数 fwrite 可以处理 UTF,编译器函数 _fwrite 执行依赖于编译器的魔术.据推测,在 Windows 上,如果需要代码页转换,则使用 MultiByteToWideChar 和 WideCharToMultiByte API.

The C++ function fwrite can handle UTF, and the compiler function _fwrite does magic that depends on the compiler. Presumably, on Windows, if code page translation is required the MultiByteToWideChar and WideCharToMultiByte API is used.

A"代码页和A"API被称为ANSI".或ASCII"或OEM",从 8 位字符开始,然后增长到双字节字符,现在已经增长到 UTF-8(1..3 字节).W"API 最初是 16 位字符,然后增长到 UTF-16(1..6 字节).两者都是多字字符编码:区别在于A"API 和代码页,字长为 8 位:对于W"API 和 UTF-16,字长为 16 位.因为它们都是多字节映射,并且因为字节"是多字节映射.和词"和字符"和字符"在不同的上下文中意味着不同的东西,因为W"尤其是A"意味着与几年前不同的事情,我只是使用A"和W"和代码页"和Unicode".

The "A" code pages and the "A" API were called "ANSI" or "ASCII" or "OEM", and started out as 8 bit characters, then grew to double-byte characters, and have now grown to UTF-8 (1..3 bytes). The "W" API started out as 16 bit characters, then grew to UTF-16 (1..6 bytes). Both are multi-word character encodings: the distinction is that for the "A" API and code pages, the word length is 8 bits: for the "W" API and UTF-16, the word length is 16 bits. Because they are both multi-byte mappings, and because "byte" and "word" and "char" and "character" mean different things in different contexts, and because "W" and particularly "A" mean different things than they did years ago, I've just use "A" and "W" and "code page" and "Unicode".

OEM"是与另一个语言环境相关联的代码页:控制台 I/O API.它是每个进程的(它是一个线程区域设置),它可以动态更改(使用 CHCP 命令)并且它的默认值在安装时设置:没有提供 GUI 来更改存储在注册表中的值.大多数控制台程序不使用控制台 I/O API,并且在编写时使用系统区域设置或用户区域设置,或者(有时不经意间)两者的混合.

"OEM" is the code page associated with another locale: The Console I/O API. It is per-process (it's a thread locale), it can be changed dynamically (using the CHCP command) and its default value is set at installation: there is no GUI provided to change the value stored in the registry. Most console programs don't use the console I/O API, and as written, use either the system locale, or the user locale, or, (sometimes inadvertently), a mixture of both.

可以使用清单来欺骗系统区域设置,并且有一个名为AppLocale"的 WinXP 实用程序.做了同样的事情.

The System Locale can be spoofed by using a manifest and there was a WinXP utility called "AppLocale" that did the same thing.

这篇关于vbscript filesystemobject 如何编码字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆