Java unicode字节解析 [英] Java unicode byte parsing

查看:108
本文介绍了Java unicode字节解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从一个文件中读取一些数据作为字节流,我刚刚遇到一些unicode字符串,我不知道如何最好地处理。



每个字符都使用两个字节,只有第一个似乎包含实际的数据,所以例如字符串'trust'存储在文件中:

  0x74 0x00(t)0x72 0x00(r)...等等

通常我只是使用正则表达式来替换零,因此删除空格。但是,文件中的单词之间的空格是使用 0x00 0x00 来实现的,所以尝试做一个简单的String'replaceAll'就会有点麻烦一点。 >

我已经尝试过使用String编码集,例如'ISO-8859-1'和'UTF-8/16',但是每当我结束空格



我创建了一个简单的正则表达式来删除双零十六进制值,它是:

  new String(bytes).replaceAll([\\\00] {2,},); 

但是这显然只适用于双零,我真的想用一个零代替一个零,并且使用一个实际的ASCII / Unicode空格字符来代替双零。 >

我可以发表一个Java字符串格式设置来处理这种事情,但是我可能错了,所以我应该创建一个正则表达式来删除零还是Java实际上提供了这样做的机制?



谢谢

解决方案

这是UTF-16LE 0x00 0x00 实际上是编码UTF-16中的NUL字符,这样就可以得到。



此编码可以编码大约一百万个不同的字符,每个字符使用2或4个字节。前256个字符用第二个字节 0x00 进行编码,如果文本仅包含它们,则可能被视为无用的,但是其余字符是必需的。例如,欧元货币符号将显示为 0xAC 0x20


I'm just in the process of reading some data from a file as a stream of bytes, and I've just encountered some unicode strings that I'm not sure how best to handle.

Each character is using two bytes, with only the first seeming to contain actual data, so for example the string 'trust' is stored in the file as:

0x74 0x00(t) 0x72 0x00(r) ...and so on

Normally I'd just use a regex to replace the zeros with nothing and therefore remove the whitespace. However, the spaces between words within the file are implemented using 0x00 0x00, so trying to do a simple String 'replaceAll' is kind of messing it up a little.

I've tried playing around with the String encoding sets, such as 'ISO-8859-1' and 'UTF-8/16', but everytime I end up with white space.

I did create a simple regex to remove the double zero hex values, which is:

new String(bytes).replaceAll("[\\00]{2,},"");

But this obviously only works for the double zero, and I'd really like to replace single zeros with nothing, and double zeros with a an actual ASCII/Unicode space character.

I could have sworn that one of the Java string format settings dealt with this kind of thing, but I might be wrong. So should I work on creating a regex to strip out the zeros, or does Java actually provide the mechanisms for doing it?

Thanks

解决方案

That's "UTF-16LE". 0x00 0x00 actually encodes the NUL character in UTF-16 so that's what you will get.

This encoding can encode about a million different characters, using 2 or 4 bytes per character. The first 256 characters are encoded with the second byte 0x00 and if the text only contains those it could be seen as useless, but it's required for the rest of the characters. For instance, the euro currency symbol would show up as 0xAC 0x20.

这篇关于Java unicode字节解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆