GHC / Haskell如何决定要对哪些字符编码进行解码/编码? [英] How does GHC/Haskell decide what character encoding it's going to decode/encode from/to?

查看:261
本文介绍了GHC / Haskell如何决定要对哪些字符编码进行解码/编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎GHC至少在它决定解码的字符编码中不一致。



考虑一个文件, omatase-shimashita。 txt ,其中包含以UTF-8编码的以下内容:



readFile 似乎正确读取了这个...

  Prelude> content<  -  readFile/home/chris/Desktop/omatase-shimashita.txt
Prelude>长度内容
8
前导>但是,如果我写一个简单的echo命令,就会出现一个简单的echo服务器,它不使用默认的UTF-8解码。考虑以下代码处理传入客户端:

  handleClient handle = do 
line< - hGetLine handle
putStrLn $阅读下面这行:++ toString line
handleClient handle

相关的客户端代码,显式发送UTF-8:

  Data.ByteString.hPutStrLn处理$ Codec.Binary.UTF8.Generic。 fromStringお待たせしました

这是不是不一致的行为?这种疯狂有什么办法吗?我计划重写我的应用程序以明确使用 ByteString 对象,并显式编码和解码使用 Codec.Binary.UTF8 ,但是很好知道这里发生了什么...:o /



更新:我运行在Ubuntu Linux 10.10版本, en_US.UTF-8的区域设置...

  $ cat / etc / default / locale 
LANG =en_US .UTF-8
$ echo $ LANG
en_US.UTF-8


解决方案

您使用的是哪个版本的GHC?旧版本尤其没有做unicode I / O非常好。



GHC文档中的此部分描述如何更改输入/输出编码:



http://haskell.org/ghc/docs/6.12.2 /html/libraries/base-4.2.0.1/System-IO.html#23



此外,文档说明:


文本模式句柄具有关联的
TextEncoding,用于在
读取时将
字节解码为Unicode字符,



默认的TextEncoding与
系统上的默认编码相同,为
,这也可以作为
localeEncoding。 (GHC注意:在Windows上,
我们目前不支持
双字节编码;如果
控制台的代码页不支持,则
,那么localeEncoding将是latin1。)

在懒惰I / O(hGetContents,
getContents和readFile)中,除了
之外,总是检测和报告编码和解码错误
,其中a
解码错误仅导致字符流的
终止,
与其他I / O错误一样。


也许这与你的问题有关系吗?如果GHC默认为utf-8之外的某处,或者您的句柄已手动设置为使用不同的编码,这可能解释了问题。如果你只是试图在控制台回显文本,那么可能是某种控制台代码页的乐趣正在发生。我知道我以前在其他语言如Python和在Windows控制台中打印unicode有类似的问题。



尝试运行 hSetEncoding handle utf8 ,看看它是否解决了您的问题。


It seems that GHC is at least inconsistent in the character encoding it decides to decode from.

Consider a file, omatase-shimashita.txt, with the following content, encoded in UTF-8: お待たせしました

readFile seems to read this in properly...

Prelude> content <- readFile "/home/chris/Desktop/omatase-shimashita.txt"
Prelude> length content
8
Prelude> putStrLn content
お待たせしました

However, if I write a simple "echo" server, it does not decode with a default of UTF-8. Consider the following code that handles an incoming client:

handleClient handle = do
  line <- hGetLine handle
  putStrLn $ "Read following line: " ++ toString line
  handleClient handle

And the relevant client code, explicitly sending UTF-8:

Data.ByteString.hPutStrLn handle $ Codec.Binary.UTF8.Generic.fromString "お待たせしました"

Is this not inconsistent behavior? Is there any method to this madness? I am planning to rewrite my application(s) to explicitly use ByteString objects and explicitly encode and decode using Codec.Binary.UTF8, but it would be good to know what's going on here anyway... :o/

UPDATE: I am running on Ubuntu Linux, version 10.10, with a locale of en_US.UTF-8...

$ cat /etc/default/locale 
LANG="en_US.UTF-8"
$ echo $LANG 
en_US.UTF-8

解决方案

Which version of GHC are you using? Older versions especially didn't do unicode I/O very well.

This section in the GHC documentation describes how to change input/output encodings:

http://haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO.html#23

Also, the documentation says this:

A text-mode Handle has an associated TextEncoding, which is used to decode bytes into Unicode characters when reading, and encode Unicode characters into bytes when writing.

The default TextEncoding is the same as the default encoding on your system, which is also available as localeEncoding. (GHC note: on Windows, we currently do not support double-byte encodings; if the console's code page is unsupported, then localeEncoding will be latin1.)

Encoding and decoding errors are always detected and reported, except during lazy I/O (hGetContents, getContents, and readFile), where a decoding error merely results in termination of the character stream, as with other I/O errors.

Maybe this has something to do with your problem? If GHC has defaulted to something other than utf-8 somewhere, or your handle has been manually set to use a different encoding, that might explain the problem. If you're just trying to echo text at the console, then probably some kind of console code-page funniness is going on. I know I've had similar problems in the past with other languages like Python and printing unicode in a windows console.

Try running hSetEncoding handle utf8 and see if it fixes your problem.

这篇关于GHC / Haskell如何决定要对哪些字符编码进行解码/编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆