使用 Javascript 的 atob 解码 base64 无法正确解码 utf-8 字符串 [英] Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

查看：92 发布时间：2021/12/27 15:14:49 javascript encoding utf-8

本文介绍了使用 Javascript 的 atob 解码 base64 无法正确解码 utf-8 字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Javascript window.atob() 函数来解码 base64 编码的字符串(特别是来自 GitHub API 的 base64 编码内容).问题是我得到了 ASCII 编码的字符(比如 ™ 而不是 ™).如何正确处理传入的 base64 编码流，以便将其解码为 utf-8?

I'm using the Javascript window.atob() function to decode a base64-encoded string (specifically the base64-encoded content from the GitHub API). Problem is I'm getting ASCII-encoded characters back (like â¢ instead of ™). How can I properly handle the incoming base64-encoded stream so that it's decoded as utf-8?

Unicode 问题

虽然 JavaScript (ECMAScript) 已经成熟，但 Base64、ASCII 和 Unicode 编码的脆弱性引起了很多头痛(其中大部分都在这个问题的历史中).

The Unicode Problem

Though JavaScript (ECMAScript) has matured, the fragility of Base64, ASCII, and Unicode encoding has caused a lot of headache (much of it is in this question's history).

考虑以下示例:

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); //   61: occupies < 1 byte

const notOK = "✓"
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok));    // YQ==
console.log(btoa(notOK)); // error

为什么我们会遇到这种情况?

Why do we encounter this?

Base64 按照设计需要二进制数据作为其输入.就 JavaScript 字符串而言，这意味着每个字符仅占一个字节的字符串.所以如果你将一个字符串传递给 btoa() 包含占用超过一个字节的字符，你会得到一个错误，因为这不被认为是二进制数据.

Base64, by design, expects binary data as its input. In terms of JavaScript strings, this means strings in which each character occupies only one byte. So if you pass a string into btoa() containing characters that occupy more than one byte, you will get an error, because this is not considered binary data.

来源:MDN(2021 年)

Source: MDN (2021)

最初的 MDN 文章还介绍了 window.btoa 和 .atob 的破坏性，它们在现代 ECMAScript 中得到了修复.原始的，现已死亡的 MDN 文章解释说:

The original MDN article also covered the broken nature of window.btoa and .atob, which have since been mended in modern ECMAScript. The original, now-dead MDN article explained:

Unicode 问题"由于 DOMString 是 16-位编码字符串，在大多数浏览器中，如果字符超出 8 位字节的范围，则在 Unicode 字符串上调用 window.btoa 将导致 Character Out Of Range 异常(0x00~0xFF).

The "Unicode Problem" Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a Unicode string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte (0x00~0xFF).

具有二进制互操作性的解决方案

(继续滚动以获取 ASCII base64 解决方案)

来源:MDN(2021 年)

Source: MDN (2021)

MDN 推荐的解决方案是实际编码到二进制字符串表示:

The solution recommended by MDN is to actually encode to and from a binary string representation:

// convert a Unicode string to a string in which
// each 16-bit unit occupies only one byte
function toBinary(string) {
  const codeUnits = new Uint16Array(string.length);
  for (let i = 0; i < codeUnits.length; i++) {
    codeUnits[i] = string.charCodeAt(i);
  }
  return btoa(String.fromCharCode(...new Uint8Array(codeUnits.buffer)));
}

// a string that contains characters occupying > 1 byte
let encoded = toBinary("✓ à la mode") // "EycgAOAAIABsAGEAIABtAG8AZABlAA=="

解码二进制 ⇢ UTF-8

function fromBinary(encoded) {
  binary = atob(encoded)
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

// our previous Base64-encoded string
let decoded = fromBinary(encoded) // "✓ à la mode"

这里有点失败，您会注意到编码字符串 EycgAOAAIABsAGEAIABtAG8AZABlAA== 不再匹配先前解决方案的字符串 4pyTIMOgIGxhIG1vZGU=.这是因为它是二进制编码的字符串，而不是 UTF-8 编码的字符串.如果这对您来说无关紧要(即，您没有从另一个系统转换以 UTF-8 表示的字符串)，那么您就可以开始了.但是，如果您想保留 UTF-8 功能，最好使用下面描述的解决方案.

Where this fails a little, is that you'll notice the encoded string EycgAOAAIABsAGEAIABtAG8AZABlAA== no longer matches the previous solution's string 4pyTIMOgIGxhIG1vZGU=. This is because it is a binary encoded string, not a UTF-8 encoded string. If this doesn't matter to you (i.e., you aren't converting strings represented in UTF-8 from another system), then you're good to go. If, however, you want to preserve the UTF-8 functionality, you're better off using the solution described below.

这个问题的整个历史显示了这些年来我们有多少种不同的方法来解决损坏的编码系统.虽然最初的 MDN 文章不再存在，但这个解决方案仍然可以说是更好的解决方案，并且在解决Unicode 问题"方面做得很好.同时维护可以解码的纯文本 base64 字符串，例如 base64decode.org.

解决这个问题有两种可能的方法:

There are two possible methods to solve this problem:

第一个是转义整个字符串(使用 UTF-8，请参阅 encodeURIComponent) 然后对其进行编码；
第二个是转换 UTF-16 DOMString 转换为 UTF-8 字符数组，然后对其进行编码.

关于以前解决方案的说明:MDN 文章最初建议使用 unescape 和 escape 来解决 Character Out Of Range 异常问题，但它们已被弃用.这里的一些其他答案建议使用 decodeURIComponent 和 encodeURIComponent 解决这个问题，这已被证明是不可靠和不可预测的.此答案的最新更新使用现代 JavaScript 函数来提高速度和现代化代码.

A note on previous solutions: the MDN article originally suggested using unescape and escape to solve the Character Out Of Range exception problem, but they have since been deprecated. Some other answers here have suggested working around this with decodeURIComponent and encodeURIComponent, this has proven to be unreliable and unpredictable. The most recent update to this answer uses modern JavaScript functions to improve speed and modernize code.

如果您想节省一些时间，也可以考虑使用图书馆:

js-base64(NPM，非常适合 Node.js)
base64-js

js-base64 (NPM, great for Node.js)
base64-js

    function b64EncodeUnicode(str) {
        // first we use encodeURIComponent to get percent-encoded UTF-8,
        // then we convert the percent encodings into raw bytes which
        // can be fed into btoa.
        return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
            function toSolidBytes(match, p1) {
                return String.fromCharCode('0x' + p1);
        }));
    }
    
    b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
    b64EncodeUnicode('
'); // "Cg=="

解码 base64 ⇢ UTF8

    function b64DecodeUnicode(str) {
        // Going backwards: from bytestream, to percent-encoding, to original string.
        return decodeURIComponent(atob(str).split('').map(function(c) {
            return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
        }).join(''));
    }
    
    b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
    b64DecodeUnicode('Cg=='); // "
"

(为什么我们需要这样做?('00' + c.charCodeAt(0).toString(16)).slice(-2) 在单个字符串前面加上 0，例如当 c == 时，c.charCodeAt(0).toString(16) 返回 a，强制 a 表示为 0a).

(Why do we need to do this? ('00' + c.charCodeAt(0).toString(16)).slice(-2) prepends a 0 to single character strings, for example when c ==, the c.charCodeAt(0).toString(16) returns a, forcing a to be represented as 0a).

这是具有一些额外 TypeScript 兼容性的相同解决方案(通过@MA-Maddin):

Here's same solution with some additional TypeScript compatibility (via @MA-Maddin):

// Encoding UTF8 ⇢ base64

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode(parseInt(p1, 16))
    }))
}

// Decoding base64 ⇢ UTF8

function b64DecodeUnicode(str) {
    return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
    }).join(''))
}

第一个解决方案(已弃用)

这里使用了 escape 和 unescape(现在已弃用，尽管这仍然适用于所有现代浏览器):

The first solution (deprecated)

This used escape and unescape (which are now deprecated, though this still works in all modern browsers):

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

最后一件事:我第一次在调用 GitHub API 时遇到这个问题.为了让它在(移动)Safari 上正常工作，我实际上不得不从 base64 源中去除所有空白我什至可以解码源.我不知道这在 2021 年是否仍然适用:

And one last thing: I first encountered this problem when calling the GitHub API. To get this to work on (Mobile) Safari properly, I actually had to strip all white space from the base64 source before I could even decode the source. Whether or not this is still relevant in 2021, I don't know:

function b64_to_utf8( str ) {
    str = str.replace(/s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}

这篇关于使用 Javascript 的 atob 解码 base64 无法正确解码 utf-8 字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Javascript 的 atob 解码 base64 无法正确解码 utf-8 字符串 [英] Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

问题描述

推荐答案

Unicode 问题

The Unicode Problem

具有二进制互操作性的解决方案

解码二进制 ⇢ UTF-8

解码 base64 ⇢ UTF8

第一个解决方案(已弃用)

The first solution (deprecated)

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用 Javascript 的 atob 解码 base64 无法正确解码 utf-8 字符串 [英] Using Javascript&#39;s atob to decode base64 doesn&#39;t properly decode utf-8 strings

问题描述

推荐答案

Unicode 问题

The Unicode Problem

具有二进制互操作性的解决方案

解码二进制 ⇢ UTF-8

解码 base64 ⇢ UTF8

第一个解决方案(已弃用)

The first solution (deprecated)

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

使用 Javascript 的 atob 解码 base64 无法正确解码 utf-8 字符串 [英] Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

登录关闭