Ruby 脚本中的 Unicode 字符? [英] Unicode characters in a Ruby script?

查看:39
本文介绍了Ruby 脚本中的 Unicode 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个将日语字符写入控制台的 Ruby 脚本.例如:

I would like to write a Ruby script which writes Japanese characters to the console. For example:

puts "こんにちは・今日は"

但是,我在运行它时遇到异常:

However, I get an exception when running it:

jap.rb:1: Invalid char `\377' in expression
jap.rb:1: Invalid char `\376' in expression

可以吗?我使用的是 Ruby 1.8.6.

Is it possible to do? I'm using Ruby 1.8.6.

推荐答案

您已将文件保存为 UTF-16LE 编码,该编码被 Windows 误导性地称为Unicode".通常最好避免这种编码,因为它不是 ASCII 超集:每个代码单元存储为两个字节,ASCII 字符的另一个字节存储为 \0.这会混淆很多软件;使用 UTF-16 进行文件存储是不常见的.

You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls "Unicode". This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.

你看到的 \377\376(\xFF\xFE 的八进制)是放在 UTF-16 文件前面的 U+FEFF 字节顺序标记序列,用于区分 UTF-16LE 和 UTF-16BE.

What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.

Ruby 1.8 完全基于字节;它不会尝试从脚本中读取 Unicode 字符.因此,您只能以 ASCII 兼容编码保存源文件.通常,您希望将文件保存为 UTF-8(没有 BOM;UTF-8 仿 BOM 是 Microsoft 的另一项伟大创新,它打破了一切).这对于生成 UTF-8 页面的网络脚本非常有用.

Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.

如果您想确保源代码能够以任何与 ASCII 兼容的编码保存,您可以对字符串进行编码以使其更具弹性(如果可读性较差):

And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):

puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"

但是!写入控制台本身就是一个大问题.用于向控制台发送字符的编码因平台而异.在 Linux 或 OS X 上,它是 UTF-8.在 Windows 上,每个安装区域设置不同的编码(在区域和语言选项"控制面板条目的非 Unicode 应用程序的语言"中选择),但它从不 UTF-8.此设置再次被误导地称为 ANSI 代码页.

However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on "Language for non-Unicode applications" in the "Regional and Language Options" control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.

因此,如果您使用日语 Windows 安装,您的控制台编码将是 Windows 代码页 932(Shift-JIS 的变体).如果是这种情况,您可以使用ANSI"或明确的Japanese cp932"从文本编辑器中保存文本文件,当您在 Ruby 中运行它时,您将获得正确的字符.同样,如果你想让源代码承受错误编码,你可以在 cp932 编码中转义字符串:

So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using "ANSI" or explicitly "Japanese cp932", and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:

puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"

但是如果你在另一个语言环境的机器上运行它,它会产生不同的字符.在西方 Windows 安装(代码页 1252)上,您将无法从 Ruby 将日语写入默认控制台.

But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).

(虽然 Ruby 1.9 改进了 Unicode 处理很多,但它没有改变这里的任何东西.它仍然是一个使用 C 标准库 IO 函数的基于字节的应用程序,这意味着它仅限于 Windows 的本地代码页.)

(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

这篇关于Ruby 脚本中的 Unicode 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆