haskell将unicode序列转换为utf 8 [英] haskell convert unicode sequence to utf 8

查看:133
本文介绍了haskell将unicode序列转换为utf 8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



有一个api返回json的所有文本,使用unicode,就像

  \\\О\\\н\\\а \\\п\\\р\\\и\\\в\\\е \\\д\\\е\\\т \\\в\\\а\\\с \\\в \\\д\\\л\\\и\\\н\\\н\\\ы\\\й \\\с\\\п\ u0438\\\с\\\о\\\к 

我想将这个json解码为utf-8,从$ json消息中打印一些数据。

我搜索了现有的库,但是找不到这个用途。



所以我编写了函数来转换数据(我使用惰性字节串,因为我从wreq库中获得了这种类型的数据)

  ununicode :: BL.ByteString  - > BL.ByteString 
ununicode s = replace s其中

替换:: BL.ByteString - > BL.ByteString
替换
(Just x) - >的str = case(Map.lookup(BL.take 6 str)表) BL.append x(替换$ BL.drop 6 str)
(Nothing) - > BL.cons(BL.head str)(替换$ BL.tail str)

table = Map.fromList $ zip letters rus

rus = [Ё, ё,А,Б,В,Г,Д,Е,Ж,З,И,É,К,Л ,М,
Н,О,П,Р,С,Т,У,Ф,Х,Ö, ,,Ы,Ъ,Ы,b,б,Ы,b,б,в ,г,д,е,ж,з,и,й,к,
л,м,н, о,п,р,с,т,у,ф,х,ц,ч,ш,щ,
ú,ы,ь,э,ю,я]

letters = [\\\\\\\\\\\\\\''''','\\\\\''' ,\\\\ 0410,\\\ 0410,\\\\\\\\\ 0410, \\ 0400,\\\\\\\\\\\\\\' \\\\\\\\\'04,\\\\\\\\\ 04 1d\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'×××××××××××××××× \\ 0423,\\\\ 0424,\\ 04 04,
\\\\ 0426,\\ 0427,\\ 0428, \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'\\\\\\\'\\\\\\\\\' ,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'' u0434,\\\ 0435,\\\\\\\\\\\\\\\\'0434,\\\\\ 0434,\\\\\\\\\\\\\\' \\\\\\\\\\\\\\\\\'\\\\\\\\\'\\\\\\\' \\\\\\\\\\\\\\\''','\\\\\\\\\\''',' , \\\\\\\\\\\\'''''''''''','\\\\\\''','\\\\\\\\''',' ,\\\\\\\\\\\\\\\\'\\\\\\\\\\'\\\\\\\\\\\' >

但它不能按我的预期工作。它取代了文字,而不是cyrrilic字母,我得到了类似于
345?C1; 8:C5< 8 = B5 2LN A4 = 52 = 8:> 2F0 <8 8 = B5 5A = KE> D5AA89 8 E> 118

第二个问题我无法调试我的功能。
当我尝试使用自定义字符串调用它时出现错误 Data.ByteString.Lazy.head:empty ByteString
我不知道理由它是空的。



它在正常程序执行期间工作正常:

  umailGet env params = do 
r< - apiGet env((method,umail.get):params)
x< - 返回$ case $ r
(Right a) - > a
(Left a) - >
return $ ununicode $ x

以及比Main $ / b $
$ b

  r2 < -  umailGet client [] 
print $ r2

最后一个问题是所有的api都可以返回任何unicode符号,所以这个解决方案在设计上是不好的。

课程功能实现似乎是不好的,所以在解决了主要问题之后,我将使用foldr重写它。



更新:
这似乎是我对问题的描述不够清晰。

所以我通过wreq lib发送请求,并获得json答案。例如

  {result:12,error:\\\Н\\\е\\\в\\ \е\\\р\\\н\\\ы\\\й \\\и\\\д\\\е\\\н\\\т\\\и\\\ф\\\и\\\к\\\а\\\т\\\о \\\р \\\с\\\е\\\с\\\с\\\и\\\и} 



<这不是结果haskell代表的结果,这是真正的ascii符号。我使用curl或firefox获得了相同的文本。 190个字节/ 190个ASCII码符号。



使用本网站 http://unicode.online-toolz.com/tools/text-unicode-entities-convertor.php 我可以将它转换为cyrrilic文本 {result:12,error:不可用}

我也尝试改变类型来使用Text而不是ByteString(包括Lazy和strict),将第一行改为 ununicode s = encodeUtf8 $ replace $ L.toStrict $ LE.decodeUtf8 s



在执行我的程序
Data.Text.Internal.Fusion.Common.head:Empty stream 。 Sot它看起来像我在我的替换功能错误,也许如果我修复它,它也将解决主要问题。

解决方案

我不确定您是否陷入打印unicode陷阱(请参阅这里) - 对于en /解码那里已经存在 hackage:Data.Text.Encoding decodeUtf8 :: ByteString - > Text encodeUtf8 :: Text - > ByteString 应该完成任务。



编辑:



我玩过text / bytestring一段时间来重现你的\\\ሴ字符 - 我不能# - }

模块主要其中

导入Data.Text(文本)
导入限定的Data.Text.Encoding为E
导入限定的数据。 Text.IO作为T
导入Data.ByteString(ByteString)
将合格的Data.ByteString.Char8导入为B


inputB :: ByteString
inputB = ДЕЖЗИЙКЛМНОПРСТУФ

inputT ::文本
inputT = ДЕЖЗИЙКЛМНОПРСТУФ


主:: IO()
=主做putStrT.putStrLn inputT:; T.putStrLn inputT
putStrB.putStrLn inputB:; B.putStrLn inputB
putStrprint inputB:;打印输入B
putStrprint inputT:;打印输入T
putStrB.putStrLn $ E.encodeUtf8 inputT:; B.putStrLn $ E.encodeUtf8 inputT
putStrT.putStrLn $ E.decodeUtf8 inputB:; T.putStrLn $ E.decodeUtf8 inputB
putStrprint $ E.decodeUtf8 inputB:;打印$ E.decodeUtf8 inputB
putStrprint $ E.encodeUtf8 inputT:;打印$ E.encodeUtf8 inputT

这是它的结果:

  T.putStrLn inputT:ДЕЖЗИЙКЛМНОПРСТУФ
B.putStrLn inputB:
RINT inputB:\DC4\NAK\SYN\ ETB \CAN\EM\SUB\ESC\FS\GS\RS\US!\#$
print inputT:\1044\1045\1046\\ \\ 1047 \\ 1048 \\ 1049 \\ 1050 \\ 1051 \\ 1052 \\ 1053 \\ 1054 \\ 1055 \\ 1056 \\ 1057 \\ 1058 \\'1059 \\ 1060
B.putStrLn $ E.encodeUtf8 inputT:ДЕЖЗИЙКЛМНОПРСТУФ
$ T.putStrLn E.decodeUtf8 inputB:
$ RINT E.decodeUtf8 inputB:\DC4\NAK\SYN\ETB\CAN\EM\\ \\SUB\ESC\FS\GS\RS\US!\#$
print $ E.encodeUtf8 inputT:\208\148\208\149\ 208\150\208\151\208\152\208\153\208\154\208\155\208\156\208\157 \\ 208 \\ 158 \\ 208 \\ \\ 159 \\ 208 \\ 160 \\ 208 \\ 160 \\ 220 \\ 168 \\ 208 \\ 163 \\ 208 \\ \\ 164

老实说,我不知道为什么我会在没有结果的字符串printlines之后得到rint行。


I am working on http client in haskell (that's my first "non exersize" project).

There is an api which returns json with all text using unicode, something like

\u041e\u043d\u0430 \u043f\u0440\u0438\u0432\u0435\u0434\u0435\u0442 \u0432\u0430\u0441 \u0432 \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0441\u043f\u0438\u0441\u043e\u043a

I want to decode this json to utf-8, to print some data from json message.

I searched for existing libraries, but find Nothing for this purpose.

So I wrote function to convert data (I am using lazy bytestrings because I got data with this type from wreq lib)

ununicode :: BL.ByteString -> BL.ByteString 
ununicode s = replace s where

    replace :: BL.ByteString -> BL.ByteString
    replace str = case (Map.lookup (BL.take 6 str) table) of
              (Just x) -> BL.append x (replace $ BL.drop 6 str)
              (Nothing) -> BL.cons (BL.head str)  (replace $ BL.tail str)

      table = Map.fromList $ zip letters rus

      rus = ["Ё", "ё", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М",
             "Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы",
             "Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к",
             "л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ",
             "ъ", "ы", "ь", "э", "ю", "я"] 

      letters = ["\\u0401", "\\u0451", "\\u0410", "\\u0411", "\\u0412", "\\u0413", 
                 "\\u0414", "\\u0415", "\\u0416", "\\u0417", "\\u0418", "\\u0419",
                 "\\u041a", "\\u041b", "\\u041c", "\\u041d", "\\u041e", "\\u041f",
                 "\\u0420", "\\u0421", "\\u0422", "\\u0423", "\\u0424", "\\u0425",
                 "\\u0426", "\\u0427", "\\u0428", "\\u0429", "\\u042a", "\\u042b",
                 "\\u042c", "\\u042d", "\\u042e", "\\u042f", "\\u0430", "\\u0431",
                 "\\u0432", "\\u0433", "\\u0434", "\\u0435", "\\u0436", "\\u0437",
                 "\\u0438", "\\u0439", "\\u043a", "\\u043b", "\\u043c", "\\u043d",
                 "\\u043e", "\\u043f", "\\u0440", "\\u0441", "\\u0442", "\\u0443",
                 "\\u0444", "\\u0445", "\\u0446", "\\u0447", "\\u0448", "\\u0449",
                 "\\u044a", "\\u044b", "\\u044c", "\\u044d", "\\u044e", "\\u044f"]

But it doesn't work as I expected. It replaces text, but instead of cyrrilic letters I got something like 345 ?C1;8:C5< 8=B5@2LN A @4=52=8:>2F0<8 8=B5@5A=KE ?@>D5AA89 8 E>118

The second problem that I can't debug my function. When I try just call it with custom string I got error Data.ByteString.Lazy.head: empty ByteString I gave no idea about reason why it's empty.

It work's fine during normal program execution:

umailGet env params = do
    r <- apiGet env (("method", "umail.get"):params)
    x <- return $ case r of
          (Right a) -> a
          (Left a)  -> ""
    return $ ununicode $ x

and than in Main

  r2 <- umailGet client []
  print $  r2

And the last problem is that all api can return any unicode symbol, so this solution is bad by design.

Of course function implementation seems to be bad to, so after solving the main problem, I am going to rewrite it using foldr.

UPDATED: It seems like I had desribed problem not enough clear.

So I am sending request via wreq lib, and get a json answer. For example

{"result":"12","error":"\u041d\u0435\u0432\u0435\u0440\u043d\u044b\u0439 \u0438\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440 \u0441\u0435\u0441\u0441\u0438\u0438"}

That's not the result of haskell representetion of result, thare are real ascii symbols. I got the same text using curl or firefox. 190 bytes/190 ascii symbols.

Using this site for example http://unicode.online-toolz.com/tools/text-unicode-entities-convertor.php I can convert it to cyrrilic text {"result":"12","error":"Неверный идентификатор сессии"}

And I need to implement something like this service using haskell (or find a package where it had been already implemented), where response like this has type Lazy Bytestring.

I also tried to change types to use Text instead of ByteString (both Lazy and strict), changed first line to ununicode s = encodeUtf8 $ replace $ L.toStrict $ LE.decodeUtf8 s

And with that new implementation I am getting an error when executing my program Data.Text.Internal.Fusion.Common.head: Empty stream. Sot it looks like I have error in my replacing function, maybe if I fix it, it also will fix the main problem.

解决方案

I am not sure if you are falling in the "print unicode" trap (see here) - for en/decoding there already exists hackage: Data.Text.Encoding decodeUtf8 :: ByteString -> Text and encodeUtf8 :: Text -> ByteString should do the task.

Edit:

I have played around with text/bytestring for some time to reproduce your "\u1234" characters - well i couldn't

{-# LANGUAGE OverloadedStrings #-}

module Main where

import           Data.Text (Text)
import qualified Data.Text.Encoding as E
import qualified Data.Text.IO as T
import           Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B


inputB :: ByteString
inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"

inputT :: Text
inputT = "ДЕЖЗИЙКЛМНОПРСТУФ"


main :: IO ()
main = do putStr "T.putStrLn inputT: "                ; T.putStrLn inputT
          putStr "B.putStrLn inputB: "                ; B.putStrLn inputB
          putStr "print inputB: "                     ; print inputB
          putStr "print inputT: "                     ; print inputT
          putStr "B.putStrLn $ E.encodeUtf8 inputT: " ; B.putStrLn $ E.encodeUtf8 inputT
          putStr "T.putStrLn $ E.decodeUtf8 inputB: " ; T.putStrLn $ E.decodeUtf8 inputB
          putStr "print $ E.decodeUtf8 inputB: "      ; print $ E.decodeUtf8 inputB
          putStr "print $ E.encodeUtf8 inputT: "      ; print $ E.encodeUtf8 inputT

here is the result of it:

T.putStrLn inputT: ДЕЖЗИЙКЛМНОПРСТУФ
B.putStrLn inputB:
rint inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
print inputT: "\1044\1045\1046\1047\1048\1049\1050\1051\1052\1053\1054\1055\1056\1057\1058\1059\1060"
B.putStrLn $ E.encodeUtf8 inputT: ДЕЖЗИЙКЛМНОПРСТУФ
T.putStrLn $ E.decodeUtf8 inputB:
rint $ E.decodeUtf8 inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
print $ E.encodeUtf8 inputT: "\208\148\208\149\208\150\208\151\208\152\208\153\208\154\208\155\208\156\208\157\208\158\208\159\208\160\208\161\208\162\208\163\208\164"

honestly I don't know why I get the "rint" lines after the bytestring printlines that yield no result.

这篇关于haskell将unicode序列转换为utf 8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆