如何在node.js中捕获utf-8解码错误? [英] How do I capture utf-8 decode errors in node.js?

查看:159
本文介绍了如何在node.js中捕获utf-8解码错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现Node(测试:v0.8.23,当前的git:v0.11.3-pre)忽略任何解码错误在其缓冲区处理中,使用'\\\�'(Unicode REPLACEMENT CHARACTER)替换任何非utf8字符,而不是抛出一个异常, utf8输入。因此,$ code> fs.readFile , process.stdin.setEncoding 和朋友掩盖了一大类错误的输入错误



不失败的例子,但真的应该是:

 > notValidUTF8 = new Buffer([128],'binary')
< Buffer 80>
> decodeAsUTF8 = notValidUTF8.toString('utf8')//这里没有抛出异常!
' '
> decodeAsUTF8 ==='\\\�'
true

\\\�'是一个完全有效的字符,可以发生在法定utf8(如序列 ef bf bd ),所以它是不平凡的基于这个结果显示的错误处理来补充猴子补丁。



深入挖掘一下,它似乎源于节点只是推迟到v8的字符串,那些又有上述的行为,v8没有任何外部世界充满外来编码的数据。



有节点模块还是别的让我抓住utf-8解码错误,最好是在输入字符串或缓冲区中发现错误的上下文?

解决方案

希望你解决问题那些年,我有一个类似的,最终解决了这个丑陋的技巧:

  function isValidUTF8(buf){
return Buffer.compare(new Buffer(buf.toString(),'utf8' ),buf)=== 0;
}

它来回转换缓冲区,并检查它是否保持不变。 p>

可以省略'utf8'编码。



然后我们有:

 > isValidUTF8(new Buffer('this is valid,指事字eèwe hope','utf8'))
true
> isValidUTF8(new Buffer([128]))
false
> isValidUTF8(new Buffer('\\\�'))
true

\\ / />

更新:现在这个工作在JXcore中也是


I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing any non-utf8 characters with '\ufffd' (the Unicode REPLACEMENT CHARACTER) instead of throwing an exception about the non-utf8 input. As a consequence, fs.readFile, process.stdin.setEncoding and friends mask a large class of bad input errors for you.

Example which doesn't fail but really ought to:

> notValidUTF8 = new Buffer([ 128 ], 'binary')
<Buffer 80>
> decodedAsUTF8 = notValidUTF8.toString('utf8') // no exception thrown here!
'�'
> decodedAsUTF8 === '\ufffd'
true

'\ufffd' is a perfectly valid character that can occur in legal utf8 (as the sequence ef bf bd), so it is non-trivial to monkey-patch in error handling based on this showing up in the result.

Digging a little deeper, it looks like this stems from node just deferring to v8's strings and that those in turn have the above behaviour, v8 not having any external world full of foreign-encoded data.

Are there node modules or otherwise that let me catch utf-8 decode errors, preferrably with context about where the error was discovered in the input string or buffer?

解决方案

I hope you solved the problem in those years, I had a similar one and eventually solved with this ugly trick:

  function isValidUTF8(buf){
   return Buffer.compare(new Buffer(buf.toString(),'utf8') , buf) === 0;
  }

which converts the buffer back and forth and check it stays the same.

The 'utf8' encoding can be omitted.

Then we have:

> isValidUTF8(new Buffer('this is valid, 指事字 eè we hope','utf8'))
true
> isValidUTF8(new Buffer([128]))
false
> isValidUTF8(new Buffer('\ufffd'))
true

where the '\ufffd' character is correctly considered as valid utf8.

UPDATE: now this works in JXcore, too

这篇关于如何在node.js中捕获utf-8解码错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆