有没有办法让在Java中的字符串的字节大小? [英] Is there any way to get the size in bytes of a string in Java?

查看:133
本文介绍了有没有办法让在Java中的字符串的字节大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在一个文件中的每一行的字节大小,这样我就可以得到读取文件的百分比。我已经拿到了文件的大小与 file.length(),但我如何得到每行的大小?

I need the size in bytes of each line in a file, so I can get a percentage of the file read. I already got the size of the file with file.length(), but how do I get each line's size?

推荐答案

您可能使用有关以下读取该文件

You probably use about the following to read the file

FileInputStream fis = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
   /* process line */
   /* report percentage */
}

您需要在一开始就指定编码。如果不这样做,你应该得到UTF-8在Android上。它是默认但可以改变。我会假设,没有设备做,虽然。

You need to specify the encoding already at the beginning. If you don't, you should get UTF-8 on Android. It is the default but that can be changed. I would assume that no device does that though.

要重复其他答案已经指出:字符数并不总是一样的字节数。特别是UTF编码棘手。目前有249764分配的Uni code字符,可能超过一百万( WP )和UTF使用1到4字节能够EN code所有的人。 UTF-32是最简单的情况,因为它总是会使用4个字节。 UTF-8的确动态,并使用1到4个字节。简单的ASCII字符只使用1个字节。 (来源: UTF&放大器; BOM常见问题解答

To repeat what the other answers already stated: The character count is not always the same as the byte count. Especially the UTF encodings are tricky. There are currently 249,764 assigned Unicode characters and potentially over a million (WP) and UTF uses 1 to 4 byte to be able to encode all of them. UTF-32 is the simplest case since it will always use 4 bytes. UTF-8 does that dynamically and uses 1 to 4 bytes. Simple ASCII characters use just 1 byte. (source: UTF & BOM FAQ)

要得到你可以使用例如字节数 line.getBytes(UTF-8)。长度()。一大缺点是,这是非常低效的,因为它创建String数组内部副本每一次,之后它扔了出去。也就是说#1 的Andr​​oid解决|性能提示

To get the amount of bytes you can use e.g. line.getBytes("UTF-8").length(). One big disadvantage is that this is very inefficient since it creates copy of the String internal array each time and throws it away after that. That is #1 addressed at Android | Performance Tips

这也是从文件中读取的原因如下实际字节计不是100%准确的:

It is also not 100% accurate in terms of actual bytes read from the file for following reasons:


  • UTF-16 TEXTFILES例如经常与一个特殊的2字节的BOM(字节顺序标记)开始信号是否具备跨preTED很少或大端。这些2(UTF-8:3,UTF-32:4)当你只是看字符串你从你的读者得到字节不报。所以,你已经有一些字节在这里下车。

  • UTF-16 textfiles for example often start with a special 2 byte BOM (Byte Order Mark) to signal whether they have to interpreted little or big endian. Those 2 (UTF-8: 3, UTF-32: 4) bytes are not reported when you just look at the String you get from your reader. So you are already some bytes off here.

打开一个文件的每一行成UTF-16 字符串将包括BOM字节每一行。因此,的getBytes 将公布2个字节太多的每一行。

Turning every line of a file into an UTF-16 String will include those BOM bytes for each line. So getBytes will report 2 bytes too much for each line.

行结束字符不在所产生的线的一部分 - 字符串。更糟糕的是你有一个信号线的末端的不同方式。通常情况下,Unix风格的的'\\ n'这是只有1个字符或Windows风格的'\\ r'的'\\ n'这两个字符。在的BufferedReader 将简单地跳过这些。在这里,你的计算丢失的字节非常多变量。从用于Unix / 1字节UTF-8为8个字节的Windows / UTF-32。

Line ending characters are not part of the resulting line-String. To make things worse you have different ways of signaling the end of a line. Usually the Unix-Style '\n' which is only 1 character or the Windows-Style '\r''\n' which is two characters. The BufferedReader will simply skip those. Here your calculation is missing a very variable amount of bytes. From 1 byte for Unix/UTF-8 to 8 bytes for Windows/UTF-32.

最后两个原因会否定对方,如果你有在Unix / UTF-16,但可能不是典型案例。错误的影响还取决于线路长度:如果你有每个在总线4字节的误差只有10字节长,你的进步将是相当相当错误的(如果我的数学是很好的进展将处于140%或60%时,在最后一行后,这取决于你的计算是否采用-4或每行+4个字节)

The last two reasons would negate each other if you have Unix/UTF-16, but that is probably not the typical case. The effect of the error also depends on line length: if you have an error of 4 byte for each line that is in total only 10 bytes long your progress will be quite considerably wrong (if my math is good your progress would be at 140% or 60% when after the last line, depending on whether your calculation assumes -4 or +4 byte per line)

这意味着到目前为止,不管你做什么,你得到的不只是一个近似值而已。

That means so far that regardless of what you do, you get no more than an approximation.

获取如果你写你自己的特殊字节计数阅读的实际字节数很可能做到,但是这将是一个相当大量的工作。

Getting the actual byte-count could probably be done if you write your own special byte counting Reader but that would be quite a lot of work.

另一种方法是使用自定义的的InputStream 的数量多少字节实际上是从底层流读取。这不是太难的事,它不关心的编码。

An alternative would be to use a custom InputStream that counts how much bytes are actually read from the underlying stream. That's not too hard to do and it does not care for encodings.

最大的缺点是,它不与你,因为的BufferedReader 读线将填补它的内部缓冲区和读取从那里线呈线性增加,然后读取下一大块文件等。如果缓冲区足够大你在100%的第一行已经。但我相信你的文件是足够大,或者你不想了解进展情况。

The big disadvantage is that it does not increase linearly with the lines you read since BufferedReader will fill it's internal buffer and read lines from there, then read the next chunk from the file and so on. If the buffer is large enough you are at 100% at the first line already. But I assume your files are big enough or you would not want to find out about the progress.

这例如是这样的实现。它的工作原理,但我不能保证它是完美的。如果流使用标记()复位这是不行的()。文件读取不应该这样做,虽然。

This for example would be such an implementation. It works but I can't guarantee that it is perfect. It won't work if streams use mark() and reset(). File reading should no do that though.

static class CountingInputStream extends FilterInputStream {
    private long bytesRead;

    protected CountingInputStream(InputStream in) {
        super(in);
    }

    @Override
    public int read() throws IOException {
        int result = super.read();
        if (result != -1) bytesRead += 1;
        return result;
    }
    @Override
    public int read(byte[] b) throws IOException {
        int result = super.read(b);
        if (result != -1) bytesRead += result;
        return result;
    }
    @Override
    public int read(byte[] b, int off, int len) throws IOException {
        int result = super.read(b, off, len);
        if (result != -1) bytesRead += result;
        return result;
    }
    @Override
    public long skip(long n) throws IOException {
        long result = super.skip(n);
        if (result != -1) bytesRead += result;
        return result;
    }

    public long getBytesRead() {
        return bytesRead;
    }
}

使用以下code

Using the following code

File file = new File("mytestfile.txt");
int linesRead = 0;
long progress = 0;
long fileLength = file.length();
String line;

CountingInputStream cis = new CountingInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(cis, "UTF-8"), 8192);
while ((line = br.readLine()) != null) {
    long newProgress = cis.getBytesRead();
    if (progress != newProgress) {
        progress = newProgress;
        int percent = (int) ((progress * 100) / fileLength);
        System.out.println(String.format("At line: %4d, bytes: %6d = %3d%%", linesRead, progress, percent));
    }
    linesRead++;
}
System.out.println("Total lines: " + linesRead);
System.out.println("Total bytes: " + fileLength);
br.close();

我得到的输出喜欢

I get output like

At line:    0, bytes:   8192 =   5%
At line:   82, bytes:  16384 =  10%
At line:  178, bytes:  24576 =  15%
....
At line: 1621, bytes: 155648 =  97%
At line: 1687, bytes: 159805 = 100%
Total lines: 1756
Total bytes: 159805

,或在同一个文件的情况下,UTF-16连接codeD

or in case of the same file UTF-16 encoded

At line:    0, bytes:  24576 =   7%
At line:   82, bytes:  40960 =  12%
At line:  178, bytes:  57344 =  17%
.....
At line: 1529, bytes: 303104 =  94%
At line: 1621, bytes: 319488 =  99%
At line: 1687, bytes: 319612 = 100%
Total lines: 1756
Total bytes: 319612

除了打印,你可以更新你的进步。

Instead of printing that you could update your progress.

那么,什么是最好的方法?

So, what is the best approach?


  • 如果你知道你在说这些字符仅使用1个字节编码简单的ASCII文本:只要使用字符串长度#()(也许添加+1或+2用于结束行)
    字符串长度#()是快速和简单,只要你知道哪些文件你有,你不应该有问题。

  • 如果您有国际的文字在这里简单的做法是行不通的:

    • 对于较小的文件,其中处理每一行需要相当长的:字符串#的getBytes(),在较长的处理1线以较低的临时阵列的影响,他们的垃圾收集。不准确应可接受的范围内。只要确保不崩溃,如果进展> 100%或LT; 100%的底。

    • 对于上述办法的文件。该文件越大越好。在0.001%的步更新进展只是速​​度放缓的东西。降低了读者的缓冲器大小将增大的准确性,但它也降低了读取性能。

    • If you know that you have simple ASCII text in an encoding that uses only 1 byte for those characters: just use String#length() (and maybe add +1 or +2 for the line ending) String#length() is fast and simple and as long as you know what files you have you should have no problems.
    • If your have international text where the simple approach won't work:
      • for smaller files where processing each line takes rather long: String#getBytes(), the longer processing 1 line takes the lower the impact of temporary arrays and their garbage collection. The inaccuracy should be within acceptable bounds. Just make sure not to crash if progress > 100% or < 100% at the end.
      • for larger files above approach. The larger the file the better. Updating progress in 0.001% steps is just slowing down things. Decreasing the reader's buffer size would increases the accuracy but it also decreases the read performance.

      这篇关于有没有办法让在Java中的字符串的字节大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆