将包含多字节字符的字符串拆分为字符串数组 [英] Splitting a string containing multi-byte characters into an array of strings

查看:78
本文介绍了将包含多字节字符的字符串拆分为字符串数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这段代码,旨在使用CHUNK_SIZE作为拆分大小(以字节为单位)将字符串拆分为字符串数组(我这样做是为了分页结果).在大多数情况下,当字符为1个字节时,这是可行的,但是当我在精确的分割位置处有一个多字节字符(例如2个字节的法语字符(如é或4个字节的中文字符)时,我最终得到在我的第一个数组元素的末尾和第二个数组元素的开头,有2个不可读的字符.

I have this piece of code which is intended to split strings into an array of strings using CHUNK_SIZE as the size of the split, in bytes (I'm doing this for paginating results). This works in most cases when characters are 1 byte, but when I have a multi-byte character (such as for example 2-byte french characters (like é) or 4 byte chinese characters) at precisely the split location, I end up with 2 unreadable characters at the end of my first array element and at the start of the second one.

是否有一种方法可以解决该代码以解决多字节字符的问题,以便将它们保留在最终结果中?

Is there a way to fix the code to account for multibyte characters so they are maintained in the final result?

public static ArrayList<String> splitFile(String data) throws Exception {
    ArrayList<String> messages = new ArrayList<>();
    int CHUNK_SIZE = 400000;// 0.75mb

    if (data.getBytes().length > CHUNK_SIZE) {
        byte[] buffer = new byte[CHUNK_SIZE];
        int start = 0, end = buffer.length;
        long remaining = data.getBytes().length;
        ByteArrayInputStream inputStream =
                new ByteArrayInputStream(data.getBytes());

        while ((inputStream.read(buffer, start, end)) != -1) {
            ByteArrayOutputStream outputStream =
                    new ByteArrayOutputStream();
            outputStream.write(buffer, start, end);
            messages.add(outputStream.toString("UTF-8"));
            remaining = remaining - end;

            if (remaining <= end) {
                end = (int) remaining;
            }
        }
        return messages;
    }

    messages.add(data);
    return messages;
}

推荐答案

public static List<String> splitFile(String data) throws IOException {
    List<String> messages = new ArrayList<>();
    final int CHUNK_SIZE = 400_000;// 0.75mb

    byte[] dataBytes = data.getBytes(StandardCharsets.UTF_8);
    byte[] buffer = new byte[CHUNK_SIZE];
    int start = 0;
    final int end = CHUNK_SIZE;
    ByteArrayInputStream inputStream = new ByteArrayInputStream(dataBytes);

    for (; ; ) {
        int read = inputStream.read(buffer, start, end - start);
        if (read == -1) {
            if (start != 0) {
                messages.add(new String(buffer, 0, start,
                        StandardCharsets.UTF_8));
            }
            break;
        }
        // Check for half read multi-byte sequences:
        int fullEnd = start + read;
        while (fullEnd > 0) {
            byte b = buffer[fullEnd - 1];
            if (b >= 0) { // ASCII.
                break;
            }
            if ((b & 0xC0) == 0xC0) { // Start byte of sequence.
                --fullEnd;
                break;
            }
            --fullEnd;
        }
        messages.add(new String(buffer, 0, fullEnd, StandardCharsets.UTF_8));
        start += read - fullEnd;
        if (start > 0) { // Copy the bytes after fullEnd to the start.
            System.arraycopy(buffer, fullEnd, buffer, 0, start);
            //               src     srcI     dest    destI len
        }
    }
    return messages;
}

我保留了ByteArrayInputStream,因为通常是从InputStream读取数据,而不是将所有字节都存储在内存中.

I have kept the ByteArrayInputStream, as most often one reads from InputStream, instead of having all bytes in memory.

然后从 start 而不是从0读取块缓冲区,因为前一次读取的块可能会残留一些字节.

Then the chunk buffer is read, from start rather then from 0, as there might linger some bytes from the prior chunk read.

Reading给出读取的字节数或-1.

Reading gives the number of bytes read or -1.

最后一个ASCII字符是可以的,否则我将结尾定位在多字节序列的开头.也许该序列已被完全读取,也许未被读取.在这里,我只保留它以便读取下一个块.

At the end an ASCII char is okay, otherwise I position the end at the beginning of a multibyte sequence. Maybe that sequence is completely read, maybe not. Here I just keep it for the next chunk being read.

此代码未看到编译器.

邮件列表也不便于存储.

A List of messages is not memory friendly too.

char [] 上的BTW可能会有类似的问题,有时Unicode代码点,符号是两个(UTF-16)字符.

BTW on char[] one would have a similar problem, sometimes a Unicode code point, symbol, is two (UTF-16) chars.

这篇关于将包含多字节字符的字符串拆分为字符串数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆