如何在 node.js 中捕获 utf-8 解码错误? [英] How do I capture utf-8 decode errors in node.js?

查看:23
本文介绍了如何在 node.js 中捕获 utf-8 解码错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现 Node(测试:v0.8.23,当前 git:v0.11.3-pre)忽略任何解码错误 在其缓冲区处理中,用 'ufffd'(Unicode 替换字符)静默替换任何非 utf8 字符,而不是抛出有关非 utf8 输入的异常.因此,fs.readFileprocess.stdin.setEncoding 和朋友们为您屏蔽了一大类错误的输入错误.

I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing any non-utf8 characters with 'ufffd' (the Unicode REPLACEMENT CHARACTER) instead of throwing an exception about the non-utf8 input. As a consequence, fs.readFile, process.stdin.setEncoding and friends mask a large class of bad input errors for you.

不会失败但确实应该失败的示例:

Example which doesn't fail but really ought to:

> notValidUTF8 = new Buffer([ 128 ], 'binary')
<Buffer 80>
> decodedAsUTF8 = notValidUTF8.toString('utf8') // no exception thrown here!
'�'
> decodedAsUTF8 === 'ufffd'
true

'ufffd' 是一个完全有效的字符,它可以出现在合法的 utf8 中(作为序列 ef bf bd),所以它对猴子来说很重要-基于此显示在结果中的错误处理补丁.

'ufffd' is a perfectly valid character that can occur in legal utf8 (as the sequence ef bf bd), so it is non-trivial to monkey-patch in error handling based on this showing up in the result.

深入挖掘,看起来这源于节点只是遵循 v8 的字符串,而那些又具有上述行为,v8 没有任何外部世界充满外国编码数据.

Digging a little deeper, it looks like this stems from node just deferring to v8's strings and that those in turn have the above behaviour, v8 not having any external world full of foreign-encoded data.

是否有节点模块或其他方式可以让我捕捉 utf-8 解码错误,最好是关于在输入字符串或缓冲区中发现错误的位置的上下文?

Are there node modules or otherwise that let me catch utf-8 decode errors, preferrably with context about where the error was discovered in the input string or buffer?

推荐答案

从 node 8.3 开始,您可以使用 util.TextDecoder 干净地解决了这个问题:

From node 8.3 on, you can use util.TextDecoder to solve this cleanly:

const util = require('util')
const td = new util.TextDecoder('utf8', {fatal:true})
td.decode(Buffer.from('foo')) // works!
td.decode(Buffer.from([ 128 ], 'binary')) // throws TypeError

这也适用于某些浏览器通过在全局命名空间中使用 TextDecoder.

This will also work in some browsers by using TextDecoder in the global namespace.

这篇关于如何在 node.js 中捕获 utf-8 解码错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆