将流缓冲区转换为utf8字符串 [英] convert streamed buffers to utf8-string

查看:90
本文介绍了将流缓冲区转换为utf8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用node.js进行HTTP请求,以从Web服务器加载一些文本.由于响应中可能包含很多文本(有些兆字节),因此我想分别处理每个文本块.我可以使用以下代码实现这一点:

I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:

var req = http.request(reqOptions, function(res) {
    ...
    res.setEncoding('utf8');
    res.on('data', function(textChunk) {
        // process utf8 text chunk
    });
});

这似乎没有问题.但是我想支持HTTP压缩,所以我使用zlib:

This seems to work without problems. However I want to support HTTP-compression, so I use zlib:

var zip = zlib.createUnzip();

// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
    // do something like checking the number of bytes downloaded
    zip.write(chunk); // give the raw bytes to zlib, s.b.
});

zip.on('data', function(chunk) {
    // convert chunk to utf8 text:
    var textChunk = chunk.toString('utf8');

    // process utf8 text chunk
});

这对于包含两个字节的'\u00c4'这样的多字节字符可能是个问题:0xC30x84.如果第一个字节被第一个块(Buffer)覆盖,第二个字节被第二个块覆盖,则chunk.toString('utf8')将在文本块的结尾/开头产生不正确的字符.如何避免这种情况?

This can be a problem for multi-byte characters like '\u00c4' which consists of two bytes: 0xC3 and 0x84. If the first byte is covered by the first chunk (Buffer) and the second byte by the second chunk then chunk.toString('utf8') will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?

提示:我仍然需要缓冲区(更具体地说是缓冲区中的字节数)来限制下载的字节数.因此,像上面的第一个示例代码一样,对未压缩的数据使用res.setEncoding('utf8')并不适合我的需求.

Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using res.setEncoding('utf8') like in the first example code above for non-compressed data does not suit my needs.

推荐答案

单个缓冲区

如果只有一个Buffer,则可以使用其 toString 方法它将使用特定的编码将全部或部分二进制内容转换为字符串.如果您不提供参数,则默认为utf8,但是在此示例中,我已明确设置了编码.

Single Buffer

If you have a single Buffer you can use its toString method that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8 if you don't provide a parameter, but I've explicitly set the encoding in this example.

var req = http.request(reqOptions, function(res) {
    ...

    res.on('data', function(chunk) {
        var textChunk = chunk.toString('utf8');
        // process utf8 text chunk
    });
});

流式缓冲区

如果像上面的问题中那样流式传输了缓冲区,则多字节UTF8字符的第一个字节可能包含在第一个Buffer(块)中,第二个字节包含在第二个Buffer中,则您应该使用 StringDecoder . :

Streamed Buffers

If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer (chunk) and the second byte in the second Buffer then you should use a StringDecoder. :

var StringDecoder = require('string_decoder').StringDecoder;

var req = http.request(reqOptions, function(res) {
    ...
    var decoder = new StringDecoder('utf8');

    res.on('data', function(chunk) {
        var textChunk = decoder.write(chunk);
        // process utf8 text chunk
    });
});

这样,StringDecoder会缓冲不完整个字符的字节,直到将所有必需的字节写入解码器为止.

This way bytes of incomplete characters are buffered by the StringDecoder until all required bytes were written to the decoder.

这篇关于将流缓冲区转换为utf8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆