Java子串打破编码 [英] Java substring broken encoding
问题描述
我以UTF-8编码从流中读取一些数据
I read some data from stream in UTF-8 encoding
String line = new String(byteArray, "UTF-8");
然后尝试查找一些子序列
then try to find some subsequence
int startPos = line.indexOf(tag) + tag.length();
int endPos = line.indexOf("/", startPos);
并减少它
String name = line.substring(startPos, endPos);
在大多数情况下,它工作正常,但有时候结果会被破坏。例如,对于输入名称,如гордунни
我得到的值类似于горд нни
,горду ни
,г рдунни
等等。
似乎代理对被随机破坏由于某些原因。我从1000中得到了4次。
In most cases it works fine, but some times result is broken. For example, for input name like "гордунни"
I got values like "горд��нни"
, "горду��ни"
, "г��рдунни"
etc.
It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.
如何解决?我是否需要使用其他String方法而不是indexOf()+ substring()或在我的结果上使用一些编码/解码魔法?
How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?
推荐答案
为了从未答复队列中取出它。
In order to get this out of the 'Unanswered' queue.
出现此问题是因为流被读取为块的字节,有时会分割多字节UTF-8字符。
The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.
通过将InputStream包装在InputStreamReader中,您将读取字符块(而不是字节块)和多字节UTF-8人物将存活下来。
By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.
这篇关于Java子串打破编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!