Windows上的Data.ByteString.Lazy.Char8换行转换---是文档误导? [英] Data.ByteString.Lazy.Char8 newline conversion on Windows---is the documentation misleading?

查看:104
本文介绍了Windows上的Data.ByteString.Lazy.Char8换行转换---是文档误导?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于bytestring库中的Data.ByteString.Lazy.Char8库的问题。具体而言,我的问题涉及到readFile函数,它的记录如下:


将整个文件懒惰地读入一个ByteString。在Windows上使用'文本模式'来解释换行符


我对这个函数将'在Windows上使用文本模式解释新行。该函数的源代码如下所示:

   -  |将整个文件/懒/读成'ByteString'。使用'文本模式'
- 在Windows上解释换行符
readFile :: FilePath - > IO ByteString
readFile f = openFile f ReadMode>> = hGetContents

我们看到从某种意义上说,文档中的声明是完全正确的: openFile 函数(而不是 openBinaryFile )已被使用,所以文件的换行符将被启用。



但是,该文件将被传递给hGetContents。这将调用 Data.ByteString.hGetNonBlocking (请参阅源代码这里 here ),它是一个非阻塞版本的数据.ByteString.hGet (参见文档);和(最后) Data.ByteString.hGet 调用 GHC.IO.Handle.hGetBuf (请参阅文档源代码) 。此函数的文档表示,
$ b


hGetBuf忽略Handle当前使用的任何TextEncoding,并直接从底层IO设备读取字节。

p>

这表明我们使用 readFile 而不是<$打开文件c $ c> readBinaryFile 是无关紧要的:即使在问题开始处引用的文档中声明了索引,数据仍将在不改变换行的情况下读取。


$ b $因此,问题的核心:
1.我错过了什么吗?是否有某种意义,Data.ByteString.Lazy.Char8.readFile在Windows上使用文本模式来解释换行符的陈述是真实的?或者是文件误导?

测试还表明,这个函数,至少在我使用它的时候是天真地使用,在Windows上不会进行新行转换。 解决方案

FWIW,包装维护人员Duncan Coutts回应了一些非常有帮助和启发性的言论。我已经要求他允许在这里发布它们,但在此期间这里是一个释义。



基本观点是文档曾经是正确的,但现在可能不是。特别是,当在Windows中打开文件时,操作系统本身允许您以文本或二进制模式打开文件。 readFile readBinaryFile used 之间的区别在于,您可以打开文件操作系统的文本模式和Win32上的二进制模式。 (他们都会在POSIX上执行相同的操作)。重要的是,如果你在操作系统的二进制模式下打开了一个文件,那么你不能在没有换行的情况下从文件中读取它:总是



当这样设置时,问题中提到的文档是正确的 - Data.ByteString .Lazy.Char8.readFile 会使用 System.IO.readFile ;这将告诉操作系统打开文件'文本',并且换行符将被转换,即使使用了 hGetBuf



然后,Haskell的 System.IO 被加强,以使其处理新行更加灵活 - 特别是允许在POSIX OS上运行的Haskell版本,那里没有功能可以读取带有内置在操作系统中的新行修改的文件,但支持用Windows样式换行符来读取文件;或者更精确地支持Python风格的'universal'换行符操作系统。这意味着:


  1. 将新行处理引入到Haskell库中;
  2. 无论是使用 readFile 还是 readBinaryFile ; 总是在Windows上以二进制模式打开。和
  3. 而是, readFile readBinaryFile 之间的选择会影响 System.IO 的库代码被设置为 nativeNewlineMode noNewlineTranslation 。这会导致Haskell库转换为你做适当的换行转换。您现在也可以选择要求 universalNewlineMode

Haskell同时获得了内置 System.IO 的正确编码支持(而不是假设输入latin-1,并简单地将输出Chars截断为它们的前8位)。总体而言,这是一件好事。



但是,关键的是,现在内置于库中的新换行符永远不会影响 hPutBuf does ---大概是因为构建新的 System.IO 功能的人认为,如果有人正在读取罚款一种二进制方式,插入自身的任何换行转换可能不是程序员想要的,即错误。事实上,它可能在99%的情况下:但在这种情况下,它会导致上述问题: - )



Duncan说文档可能会改变以反映这个图书馆未来版本中的这个勇敢的新世界。在此期间,在此问题的另一个答案中列出了一种解决方法。


I have a question about the Data.ByteString.Lazy.Char8 library in the bytestring library. Specifically, my question concerns the readFile function, which is documented as follows:

Read an entire file lazily into a ByteString. Use 'text mode' on Windows to interpret newlines

I'm interested in the claim that this function will 'use text mode on Windows to interpret newlines'. The source code for the function is as follows:

-- | Read an entire file /lazily/ into a 'ByteString'. Use 'text mode'
-- on Windows to interpret newlines
readFile :: FilePath -> IO ByteString
readFile f = openFile f ReadMode >>= hGetContents

and we see that, in one sense, the claim in the documentation is perfectly true: the openFile function (as opposed to openBinaryFile) has been used, and so newline conversion will be enabled for the file.

But, the file will then be passed to hGetContents. This will call Data.ByteString.hGetNonBlocking (see the source code here and here), which is meant to be a non blocking version of Data.ByteString.hGet (see the documentation); and (finally) Data.ByteString.hGet calls GHC.IO.Handle.hGetBuf (see the documentation or the source code). This function's documentation says that

hGetBuf ignores whatever TextEncoding the Handle is currently using, and reads bytes directly from the underlying IO device.

which suggests that the fact that we opened the file using readFile rather than readBinaryFile is irrelevant: the data will be read without transforming newlines, notwithstanding the claim in the documentation referred to at the beginning of the question.

So, the nub of the question: 1. Am I missing something? Is there some sense in which the statement 'that Data.ByteString.Lazy.Char8.readFile uses text mode on Windows to interpret newlines' is true? Or is the documentation just misleading?

P.S. Testing also indicates that this function, at least when used naively as I was using it, does no newline conversion on Windows.

解决方案

FWIW, the package maintainer, Duncan Coutts, responded with some very helpful and enlightening remarks. I've asked for his permission to post them here, but in the interim here is a paraphrase.

The basic point is that the documentation was once correct, but now probably is not. In particular, when one opens a file in windows, the operating system itself lets you open it in 'text' or 'binary' modes. The difference between readFile and readBinaryFile used to be that one would open the file in the OS's text mode and one in binary mode on Win32. (They would both do the same on POSIX.) Critically, if you opened a file in the OS's binary mode, there was no way you could read from the file without newline conversion: it happened always.

When things were set up like this, the documentation referred to in the question was correct---Data.ByteString.Lazy.Char8.readFile would use System.IO.readFile; this would tell the OS to open the file 'Text', and newlines would be converted, even though hGetBuf was being used.

Then, later, Haskell's System.IO was souped up to make its handling of newlines more flexible---specifically to allow versions of Haskell running on POSIX OSs, where there is no functionality to read files with newline mangling built into the OS, nonetheless to support reading files with Windows style newlines; or more accurately to support Python-style 'universal' newline conversion on both OSs. This meant that:

  1. The handling of newlines was brought into the Haskell libraries;
  2. Files are always opened in binary mode on Windows, whether you use readFile or readBinaryFile; and
  3. Instead, the choice between readFile and readBinaryFile would affect whether System.IO's library code was set up to be in nativeNewlineMode or noNewlineTranslation. This would then cause the Haskell library conversion to do appropriate newline conversion for you. You could now also choose to ask for universalNewlineMode.

This was at about the same time as Haskell got proper encoding support built in to System.IO (rather than assuming latin-1 on input and simply truncating output Chars to their first 8 bits). Overall, it was a Good Thing.

But, critically, the new newline conversion, now built in to the libraries, never affects what hPutBuf does---presumably because the people building the new System.IO functionality thought that if one was reading the fine in a binary way, any newline conversion interposing itself was probably not What the Programmer Wanted, i.e. was a mistake. And indeed, it probably is in 99% of cases: but in this case, it causes the problem above :-)

Duncan says that the docs will probably change to reflect this brave new world in future releases of the library. In the interim, there is a workaround listed in another answer to this question.

这篇关于Windows上的Data.ByteString.Lazy.Char8换行转换---是文档误导?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆