文本或字节串 [英] Text or Bytestring

查看:99
本文介绍了文本或字节串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

美好的一天。



我现在讨厌Haskell的一件事是用于处理字符串的包的数量。



首先,我使用本地Haskell [Char] 字符串,但是当我试图开始使用hackage库时,完全失去了无尽的转换。每个包似乎都使用不同的字符串实现,有些采用自己的手工制作。

$ c>字符串和 OverloadedStrings 扩展名,我选择了 Text ,因为它具有更广泛的功能,但似乎很多项目更喜欢 ByteString

有人可以给出简短的推理,为什么要使用一个或另一个?



< PS:btw如何从 Text 转换为 ByteString


无法匹配预期类型
Data.ByteString.Lazy.Internal.ByteString
针对推断类型文本 em>
预期类型:IO Data.ByteString.Lazy.Internal.ByteString
推断类型:IO Text


我从 Data.Text.Encoding 尝试了 encodeUtf8 ,但没有运气:


无法匹配预期类型
Da ta.ByteString.Lazy.Internal.ByteString
针对推断的类型 Data.ByteString.Internal.ByteString


UPD:

感谢您的回应,*大块的善良看起来像要走,但我有些震惊结果,我原来的功能看起来像这样:

  htmlToItems :: Text  - > [Item] 
htmlToItems =
getItems。 parseTags。 convertFuzzy DiscardCP1251UTF8

现在变成了:

  htmlToItems :: Text  - > [Item] 
htmlToItems =
getItems。 parseTags。来自LazyBS。 convertFuzzy DiscardCP1251UTF8。 toLazyBS
where
toLazyBS t = fromChunks [encodeUtf8 t]
fromLazyBS t = decodeUtf8 $ intercalate$ toChunks t

是的,这个函数不工作,因为它错了,如果我们给它提供 Text ,那么我们有信心文本被正确编码并准备使用并将其转换为愚蠢的事情,但是这种详细的转换仍然必须发生在 htmltoItems 之外的某处。

解决方案

ByteStrings 主要用于二进制数据,但它们也是处理文本的有效方式如果你需要的只是ASCII字符集。如果你需要处理unicode字符串,你需要使用 Text 。然而,我必须强调,这两者都不是替代对象,它们通常用于不同的事情:虽然 Text 代表纯Unicode,但您仍然需要编码二进制 ByteString 表示只要你例如通过套接字或文件传输文本。



下面是一篇关于unicode基础知识的好文章,它解释unicode代码点编码的二进制字节( ByteString ):绝对最低限度每个软件开发人员绝对积极地必须了解Unicode和字符集



您可以使用 Data.Text.Encoding 模块之间进行转换两个数据类型,或者 Data.Text.Lazy。编码,如果你正在使用懒惰变体(你似乎正在根据你的错误信息做)。


Good day.

The one thing I now hate about Haskell is quantity of packages for working with string.

First I used native Haskell [Char] strings, but when I tried to start using hackage libraries then completely lost in endless conversions. Every package seem to use different strings implementation, some adopts their own handmade thing.

Next I rewrote my code with Data.Text strings and OverloadedStrings extension, I chose Text because it has a wider set of functions, but it seems many projects prefer ByteString.
Someone could give short reasoning why to use one or other?

PS: btw how to convert from Text to ByteString?

Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Text Expected type: IO Data.ByteString.Lazy.Internal.ByteString Inferred type: IO Text

I tried encodeUtf8 from Data.Text.Encoding, but no luck:

Couldn't match expected type Data.ByteString.Lazy.Internal.ByteString against inferred type Data.ByteString.Internal.ByteString

UPD:

Thanks for responses, that *Chunks goodness looks like way to go, but I somewhat shocked with result, my original function looked like this:

htmlToItems :: Text -> [Item]
htmlToItems =
    getItems . parseTags . convertFuzzy Discard "CP1251" "UTF8"

And now became:

htmlToItems :: Text -> [Item]
htmlToItems =
    getItems . parseTags . fromLazyBS . convertFuzzy Discard "CP1251" "UTF8" . toLazyBS
    where
      toLazyBS t = fromChunks [encodeUtf8 t]
      fromLazyBS t = decodeUtf8 $ intercalate "" $ toChunks t

And yes, this function is not working because its wrong, if we supply Text to it, then we're confident this text is properly encoded and ready to use and converting it is stupid thing to do, but such a verbose conversion still has to take place somewhere outside htmltoItems.

解决方案

ByteStrings are mainly useful for binary data, but they are also an efficient way to process text if all you need is the ASCII character set. If you need to handle unicode strings, you need to use Text. However, I must emphasize that neither is a replacement for the other, they are generally used for different things: While Text represents pure unicode, you still need to encode to and from a binary ByteString representation whenever you e.g. transport text via a socket or a file.

Here is a good article about the basics of unicode, which does a decent job of explaining the relation of unicode code-points (Text) and the encoded binary bytes (ByteString): The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

You can use the Data.Text.Encoding module to convert between the two datatypes, or Data.Text.Lazy.Encoding if you are using the lazy variants (as you seem to be doing based on your error messages).

这篇关于文本或字节串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆