haskell将unicode序列转换为utf 8 [英] haskell convert unicode sequence to utf 8
问题描述
有一个api返回json的所有文本,使用unicode,就像
\\\О\\\н\\\а \\\п\\\р\\\и\\\в\\\е \\\д\\\е\\\т \\\в\\\а\\\с \\\в \\\д\\\л\\\и\\\н\\\н\\\ы\\\й \\\с\\\п\ u0438\\\с\\\о\\\к
我想将这个json解码为utf-8,从$ json消息中打印一些数据。
我搜索了现有的库,但是找不到这个用途。
所以我编写了函数来转换数据(我使用惰性字节串,因为我从wreq库中获得了这种类型的数据)
ununicode :: BL.ByteString - > BL.ByteString
ununicode s = replace s其中
替换:: BL.ByteString - > BL.ByteString
替换
(Just x) - >的str = case(Map.lookup(BL.take 6 str)表) BL.append x(替换$ BL.drop 6 str)
(Nothing) - > BL.cons(BL.head str)(替换$ BL.tail str)
table = Map.fromList $ zip letters rus
rus = [Ё, ё,А,Б,В,Г,Д,Е,Ж,З,И,É,К,Л ,М,
Н,О,П,Р,С,Т,У,Ф,Х,Ö, ,,Ы,Ъ,Ы,b,б,Ы,b,б,в ,г,д,е,ж,з,и,й,к,
л,м,н, о,п,р,с,т,у,ф,х,ц,ч,ш,щ,
ú,ы,ь,э,ю,я]
letters = [\\\\\\\\\\\\\\''''','\\\\\''' ,\\\\ 0410,\\\ 0410,\\\\\\\\\ 0410, \\ 0400,\\\\\\\\\\\\\\' \\\\\\\\\'04,\\\\\\\\\ 04 1d\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'×××××××××××××××× \\ 0423,\\\\ 0424,\\ 04 04,
\\\\ 0426,\\ 0427,\\ 0428, \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'\\\\\\\'\\\\\\\\\' ,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'' u0434,\\\ 0435,\\\\\\\\\\\\\\\\'0434,\\\\\ 0434,\\\\\\\\\\\\\\' \\\\\\\\\\\\\\\\\'\\\\\\\\\'\\\\\\\' \\\\\\\\\\\\\\\''','\\\\\\\\\\''',' , \\\\\\\\\\\\'''''''''''','\\\\\\''','\\\\\\\\''',' ,\\\\\\\\\\\\\\\\'\\\\\\\\\\'\\\\\\\\\\\' >
但它不能按我的预期工作。它取代了文字,而不是cyrrilic字母,我得到了类似于
345?C1; 8:C5< 8 = B5 2LN A4 = 52 = 8:> 2F0 <8 8 = B5 5A = KE> D5AA89 8 E> 118
第二个问题我无法调试我的功能。
当我尝试使用自定义字符串调用它时出现错误 Data.ByteString.Lazy.head:empty ByteString
我不知道理由它是空的。
它在正常程序执行期间工作正常:
umailGet env params = do
r< - apiGet env((method,umail.get):params)
x< - 返回$ case $ r
(Right a) - > a
(Left a) - >
return $ ununicode $ x
以及比Main $ / b $
$ b
r2 < - umailGet client []
print $ r2
最后一个问题是所有的api都可以返回任何unicode符号,所以这个解决方案在设计上是不好的。
课程功能实现似乎是不好的,所以在解决了主要问题之后,我将使用foldr重写它。
更新:
这似乎是我对问题的描述不够清晰。
所以我通过wreq lib发送请求,并获得json答案。例如
{result:12,error:\\\Н\\\е\\\в\\ \е\\\р\\\н\\\ы\\\й \\\и\\\д\\\е\\\н\\\т\\\и\\\ф\\\и\\\к\\\а\\\т\\\о \\\р \\\с\\\е\\\с\\\с\\\и\\\и}
<这不是结果haskell代表的结果,这是真正的ascii符号。我使用curl或firefox获得了相同的文本。 190个字节/ 190个ASCII码符号。
使用本网站 http://unicode.online-toolz.com/tools/text-unicode-entities-convertor.php 我可以将它转换为cyrrilic文本 {result:12,error:不可用} 这个服务使用haskell(或者找到一个已经实现的包),这样的响应类型为Lazy Bytestring。
我也尝试改变类型来使用Text而不是ByteString(包括Lazy和strict),将第一行改为 ununicode s = encodeUtf8 $ replace $ L.toStrict $ LE.decodeUtf8 s
在执行我的程序
Data.Text.Internal.Fusion.Common.head:Empty stream
。 Sot它看起来像我在我的替换功能错误,也许如果我修复它,它也将解决主要问题。
解决方案我不确定您是否陷入打印unicode陷阱(请参阅这里) - 对于en /解码那里已经存在 hackage:Data.Text.Encoding decodeUtf8 :: ByteString - > Text
和 encodeUtf8 :: Text - > ByteString
应该完成任务。
编辑:
我玩过text / bytestring一段时间来重现你的\\\ሴ字符 - 我不能# - }
模块主要其中
导入Data.Text(文本)
导入限定的Data.Text.Encoding为E
导入限定的数据。 Text.IO作为T
导入Data.ByteString(ByteString)
将合格的Data.ByteString.Char8导入为B
inputB :: ByteString
inputB = ДЕЖЗИЙКЛМНОПРСТУФ
inputT ::文本
inputT = ДЕЖЗИЙКЛМНОПРСТУФ
主:: IO()
=主做putStrT.putStrLn inputT:; T.putStrLn inputT
putStrB.putStrLn inputB:; B.putStrLn inputB
putStrprint inputB:;打印输入B
putStrprint inputT:;打印输入T
putStrB.putStrLn $ E.encodeUtf8 inputT:; B.putStrLn $ E.encodeUtf8 inputT
putStrT.putStrLn $ E.decodeUtf8 inputB:; T.putStrLn $ E.decodeUtf8 inputB
putStrprint $ E.decodeUtf8 inputB:;打印$ E.decodeUtf8 inputB
putStrprint $ E.encodeUtf8 inputT:;打印$ E.encodeUtf8 inputT
这是它的结果:
T.putStrLn inputT:ДЕЖЗИЙКЛМНОПРСТУФ
B.putStrLn inputB:
RINT inputB:\DC4\NAK\SYN\ ETB \CAN\EM\SUB\ESC\FS\GS\RS\US!\#$
print inputT:\1044\1045\1046\\ \\ 1047 \\ 1048 \\ 1049 \\ 1050 \\ 1051 \\ 1052 \\ 1053 \\ 1054 \\ 1055 \\ 1056 \\ 1057 \\ 1058 \\'1059 \\ 1060
B.putStrLn $ E.encodeUtf8 inputT:ДЕЖЗИЙКЛМНОПРСТУФ
$ T.putStrLn E.decodeUtf8 inputB:
$ RINT E.decodeUtf8 inputB:\DC4\NAK\SYN\ETB\CAN\EM\\ \\SUB\ESC\FS\GS\RS\US!\#$
print $ E.encodeUtf8 inputT:\208\148\208\149\ 208\150\208\151\208\152\208\153\208\154\208\155\208\156\208\157 \\ 208 \\ 158 \\ 208 \\ \\ 159 \\ 208 \\ 160 \\ 208 \\ 160 \\ 220 \\ 168 \\ 208 \\ 163 \\ 208 \\ \\ 164
老实说,我不知道为什么我会在没有结果的字符串printlines之后得到rint行。
I am working on http client in haskell (that's my first "non exersize" project).
There is an api which returns json with all text using unicode, something like
\u041e\u043d\u0430 \u043f\u0440\u0438\u0432\u0435\u0434\u0435\u0442 \u0432\u0430\u0441 \u0432 \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0441\u043f\u0438\u0441\u043e\u043a
I want to decode this json to utf-8, to print some data from json message.
I searched for existing libraries, but find Nothing for this purpose.
So I wrote function to convert data (I am using lazy bytestrings because I got data with this type from wreq lib)
ununicode :: BL.ByteString -> BL.ByteString
ununicode s = replace s where
replace :: BL.ByteString -> BL.ByteString
replace str = case (Map.lookup (BL.take 6 str) table) of
(Just x) -> BL.append x (replace $ BL.drop 6 str)
(Nothing) -> BL.cons (BL.head str) (replace $ BL.tail str)
table = Map.fromList $ zip letters rus
rus = ["Ё", "ё", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М",
"Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы",
"Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к",
"л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ",
"ъ", "ы", "ь", "э", "ю", "я"]
letters = ["\\u0401", "\\u0451", "\\u0410", "\\u0411", "\\u0412", "\\u0413",
"\\u0414", "\\u0415", "\\u0416", "\\u0417", "\\u0418", "\\u0419",
"\\u041a", "\\u041b", "\\u041c", "\\u041d", "\\u041e", "\\u041f",
"\\u0420", "\\u0421", "\\u0422", "\\u0423", "\\u0424", "\\u0425",
"\\u0426", "\\u0427", "\\u0428", "\\u0429", "\\u042a", "\\u042b",
"\\u042c", "\\u042d", "\\u042e", "\\u042f", "\\u0430", "\\u0431",
"\\u0432", "\\u0433", "\\u0434", "\\u0435", "\\u0436", "\\u0437",
"\\u0438", "\\u0439", "\\u043a", "\\u043b", "\\u043c", "\\u043d",
"\\u043e", "\\u043f", "\\u0440", "\\u0441", "\\u0442", "\\u0443",
"\\u0444", "\\u0445", "\\u0446", "\\u0447", "\\u0448", "\\u0449",
"\\u044a", "\\u044b", "\\u044c", "\\u044d", "\\u044e", "\\u044f"]
But it doesn't work as I expected. It replaces text, but instead of cyrrilic letters I got something like 345 ?C1;8:C5< 8=B5@2LN A @4=52=8:>2F0<8 8=B5@5A=KE ?@>D5AA89 8 E>118
The second problem that I can't debug my function.
When I try just call it with custom string I got error Data.ByteString.Lazy.head: empty ByteString
I gave no idea about reason why it's empty.
It work's fine during normal program execution:
umailGet env params = do
r <- apiGet env (("method", "umail.get"):params)
x <- return $ case r of
(Right a) -> a
(Left a) -> ""
return $ ununicode $ x
and than in Main
r2 <- umailGet client []
print $ r2
And the last problem is that all api can return any unicode symbol, so this solution is bad by design.
Of course function implementation seems to be bad to, so after solving the main problem, I am going to rewrite it using foldr.
UPDATED: It seems like I had desribed problem not enough clear.
So I am sending request via wreq lib, and get a json answer. For example
{"result":"12","error":"\u041d\u0435\u0432\u0435\u0440\u043d\u044b\u0439 \u0438\u0434\u0435\u043d\u0442\u0438\u0444\u0438\u043a\u0430\u0442\u043e\u0440 \u0441\u0435\u0441\u0441\u0438\u0438"}
That's not the result of haskell representetion of result, thare are real ascii symbols. I got the same text using curl or firefox. 190 bytes/190 ascii symbols.
Using this site for example http://unicode.online-toolz.com/tools/text-unicode-entities-convertor.php I can convert it to cyrrilic text {"result":"12","error":"Неверный идентификатор сессии"}
And I need to implement something like this service using haskell (or find a package where it had been already implemented), where response like this has type Lazy Bytestring.
I also tried to change types to use Text instead of ByteString (both Lazy and strict), changed first line to ununicode s = encodeUtf8 $ replace $ L.toStrict $ LE.decodeUtf8 s
And with that new implementation I am getting an error when executing my program
Data.Text.Internal.Fusion.Common.head: Empty stream
. Sot it looks like I have error in my replacing function, maybe if I fix it, it also will fix the main problem.
I am not sure if you are falling in the "print unicode" trap (see here) - for en/decoding there already exists hackage: Data.Text.Encoding decodeUtf8 :: ByteString -> Text
and encodeUtf8 :: Text -> ByteString
should do the task.
Edit:
I have played around with text/bytestring for some time to reproduce your "\u1234" characters - well i couldn't
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Text (Text)
import qualified Data.Text.Encoding as E
import qualified Data.Text.IO as T
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B
inputB :: ByteString
inputB = "ДЕЖЗИЙКЛМНОПРСТУФ"
inputT :: Text
inputT = "ДЕЖЗИЙКЛМНОПРСТУФ"
main :: IO ()
main = do putStr "T.putStrLn inputT: " ; T.putStrLn inputT
putStr "B.putStrLn inputB: " ; B.putStrLn inputB
putStr "print inputB: " ; print inputB
putStr "print inputT: " ; print inputT
putStr "B.putStrLn $ E.encodeUtf8 inputT: " ; B.putStrLn $ E.encodeUtf8 inputT
putStr "T.putStrLn $ E.decodeUtf8 inputB: " ; T.putStrLn $ E.decodeUtf8 inputB
putStr "print $ E.decodeUtf8 inputB: " ; print $ E.decodeUtf8 inputB
putStr "print $ E.encodeUtf8 inputT: " ; print $ E.encodeUtf8 inputT
here is the result of it:
T.putStrLn inputT: ДЕЖЗИЙКЛМНОПРСТУФ
B.putStrLn inputB:
rint inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
print inputT: "\1044\1045\1046\1047\1048\1049\1050\1051\1052\1053\1054\1055\1056\1057\1058\1059\1060"
B.putStrLn $ E.encodeUtf8 inputT: ДЕЖЗИЙКЛМНОПРСТУФ
T.putStrLn $ E.decodeUtf8 inputB:
rint $ E.decodeUtf8 inputB: "\DC4\NAK\SYN\ETB\CAN\EM\SUB\ESC\FS\GS\RS\US !\"#$"
print $ E.encodeUtf8 inputT: "\208\148\208\149\208\150\208\151\208\152\208\153\208\154\208\155\208\156\208\157\208\158\208\159\208\160\208\161\208\162\208\163\208\164"
honestly I don't know why I get the "rint" lines after the bytestring printlines that yield no result.
这篇关于haskell将unicode序列转换为utf 8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!