Java子串打破编码 [英] Java substring broken encoding

查看:126
本文介绍了Java子串打破编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以UTF-8编码从流中读取一些数据

I read some data from stream in UTF-8 encoding

String line = new String(byteArray, "UTF-8");

然后尝试查找一些子序列

then try to find some subsequence

int startPos = line.indexOf(tag) + tag.length();
int endPos   = line.indexOf("/", startPos);

并减少它

String name = line.substring(startPos, endPos);

在大多数情况下,它工作正常,但有时候结果会被破坏。例如,对于输入名称,如гордунни我得到的值类似于горд ннигорду ниг рдунни等等。
似乎代理对被随机破坏由于某些原因。我从1000中得到了4次。

In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc. It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.

如何解决?我是否需要使用其他String方法而不是indexOf()+ substring()或在我的结果上使用一些编码/解码魔法?

How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?

推荐答案

为了从未答复队列中取出它。

In order to get this out of the 'Unanswered' queue.

出现此问题是因为流被读取为块的字节,有时会分割多字节UTF-8字符。

The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.

通过将InputStream包装在InputStreamReader中,您将读取字符块(而不是字节块)和多字节UTF-8人物将存活下来。

By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.

这篇关于Java子串打破编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆