GHC/Haskell 如何决定从/到解码/编码的字符编码? [英] How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?
问题描述
似乎 GHC 至少在它决定解码的字符编码上是不一致的.
It seems that GHC is at least inconsistent in the character encoding it decides to decode from.
考虑一个文件,omatase-shimashita.txt
,内容如下,UTF-8 编码:お待たせしました
Consider a file, omatase-shimashita.txt
, with the following content, encoded in UTF-8: お待たせしました
readFile
似乎正确读取了这个...
readFile
seems to read this in properly...
Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました
但是,如果我编写一个简单的echo"服务器,它不会使用默认的 UTF-8 进行解码.考虑以下处理传入客户端的代码:
However, if I write a simple "echo" server, it does not decode with a default of UTF-8. Consider the following code that handles an incoming client:
handleClient handle = do
line <- hGetLine handle
putStrLn $ "Read following line: " ++ toString line
handleClient handle
以及相关的客户端代码,明确发送UTF-8:
And the relevant client code, explicitly sending UTF-8:
Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"
这不是不一致的行为吗?这种疯狂有什么方法吗?我计划重写我的应用程序以显式使用 ByteString
对象并使用 Codec.Binary.UTF8
显式编码和解码,但最好知道发生了什么无论如何在这里... :o/
Is this not inconsistent behavior? Is there any method to this madness? I am planning to rewrite my application(s) to explicitly use ByteString
objects and explicitly encode and decode using Codec.Binary.UTF8
, but it would be good to know what's going on here anyway... :o/
更新:我在 Ubuntu Linux 版本 10.10 上运行,区域设置为 en_US.UTF-8...
UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8...
$ cat /etc/default/locale
LANG="en_US.UTF-8"
$ echo $LANG
en_US.UTF-8
推荐答案
您使用的是哪个版本的 GHC?旧版本尤其不能很好地处理 unicode I/O.
Which version of GHC are you using? Older versions especially didn't do unicode I/O very well.
GHC 文档中的这一部分描述了如何更改输入/输出编码:
This section in the GHC documentation describes how to change input/output encodings:
http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23
另外,文档是这样说的:
Also, the documentation says this:
文本模式句柄有一个关联的TextEncoding,用于解码字节转换为 Unicode 字符时读取和编码 Unicode 字符写入时转换为字节.
A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing.
默认TextEncoding是一样的作为您的默认编码系统,也可作为语言环境编码.(GHC 注意:在 Windows 上,我们目前不支持双字节编码;如果控制台的代码页不受支持,那么 localeEncoding 将为 latin1.)
The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding. (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.)
编码和解码错误是总是检测和报告,除了在惰性 I/O (hGetContents,getContents 和 readFile),其中解码错误只会导致字符流的终止,与其他 I/O 错误一样.
Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors.
也许这与您的问题有关?如果 GHC 在某处默认为 utf-8 以外的其他内容,或者您的句柄已手动设置为使用不同的编码,则可能会解释问题.如果您只是想在控制台上回显文本,那么可能正在发生某种控制台代码页的乐趣.我知道我过去在使用 Python 等其他语言和在 Windows 控制台中打印 unicode 时也遇到过类似问题.
Maybe this has something to do with your problem? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console.
尝试运行 hSetEncoding handle utf8
看看它是否能解决您的问题.
Try running hSetEncoding handle utf8
and see if it fixes your problem.
这篇关于GHC/Haskell 如何决定从/到解码/编码的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!