GHC/Haskell 如何决定从/到解码/编码的字符编码? [英] How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?

查看:23
本文介绍了GHC/Haskell 如何决定从/到解码/编码的字符编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎 GHC 至少在它决定解码的字符编码上是不一致的.

It seems that GHC is at least inconsistent in the character encoding it decides to decode from.

考虑一个文件,omatase-shimashita.txt,内容如下,UTF-8 编码:お待たせしました

Consider a file, omatase-shimashita.txt, with the following content, encoded in UTF-8: お待たせしました

readFile 似乎正确读取了这个...

readFile seems to read this in properly...

Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました

但是,如果我编写一个简单的echo"服务器,它不会使用默认的 UTF-8 进行解码.考虑以下处理传入客户端的代码:

However, if I write a simple "echo" server, it does not decode with a default of UTF-8. Consider the following code that handles an incoming client:

handleClient handle = do
  line <- hGetLine handle
  putStrLn $ "Read following line: " ++ toString line
  handleClient handle

以及相关的客户端代码,明确发送UTF-8:

And the relevant client code, explicitly sending UTF-8:

Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"

这不是不一致的行为吗?这种疯狂有什么方法吗?我计划重写我的应用程序以显式使用 ByteString 对象并使用 Codec.Binary.UTF8 显式编码和解码,但最好知道发生了什么无论如何在这里... :o/

Is this not inconsistent behavior? Is there any method to this madness? I am planning to rewrite my application(s) to explicitly use ByteString objects and explicitly encode and decode using Codec.Binary.UTF8, but it would be good to know what's going on here anyway... :o/

更新:我在 Ubuntu Linux 版本 10.10 上运行,区域设置为 en_US.UTF-8...

UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8...

$ cat /etc/default/locale 
LANG="en_US.UTF-8"
$ echo $LANG 
en_US.UTF-8

推荐答案

您使用的是哪个版本的 GHC?旧版本尤其不能很好地处理 unicode I/O.

Which version of GHC are you using? Older versions especially didn't do unicode I/O very well.

GHC 文档中的这一部分描述了如何更改输入/输出编码:

This section in the GHC documentation describes how to change input/output encodings:

http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23

另外,文档是这样说的:

Also, the documentation says this:

文本模式句柄有一个关联的TextEncoding,用于解码字节转换为 Unicode 字符时读取和编码 Unicode 字符写入时转换为字节.

A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing.

默认TextEncoding是一样的作为您的默认编码系统,也可作为语言环境编码.(GHC 注意:在 Windows 上,我们目前不支持双字节编码;如果控制台的代码页不受支持,那么 localeEncoding 将为 latin1.)

The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding. (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.)

编码和解码错误是总是检测和报告,除了在惰性 I/O (hGetContents,getContents 和 readFile),其中解码错误只会导致字符流的终止,与其他 I/O 错误一样.

Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors.

也许这与您的问题有关?如果 GHC 在某处默认为 utf-8 以外的其他内容,或者您​​的句柄已手动设置为使用不同的编码,则可能会解释问题.如果您只是想在控制台上回显文本,那么可能正在发生某种控制台代码页的乐趣.我知道我过去在使用 Python 等其他语言和在 Windows 控制台中打印 unicode 时也遇到过类似问题.

Maybe this has something to do with your problem? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console.

尝试运行 hSetEncoding handle utf8 看看它是否能解决您的问题.

Try running hSetEncoding handle utf8 and see if it fixes your problem.

这篇关于GHC/Haskell 如何决定从/到解码/编码的字符编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆