如何生成非UTF-8字符集 [英] How can I generate a non-UTF-8 Character Set

查看:527
本文介绍了如何生成非UTF-8字符集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的要求之一是文本框名称应仅接受UTF-8字符集".我想通过输入非UTF-8字符集来执行否定测试.我该怎么办?

One of my requirement says "Text Box Name should accept only UTF-8 Character set". I want to perform a negative test by entering a non UTF-8 character set. How can I do this?

推荐答案

如果您要问如何构造非UTF-8字符,则可以轻松地从

If you are asking how to construct a non-UTF-8 character, that should be easy from this definition from Wikipedia:

对于代码点U + 0000至U + 007F,每个代码点长1个字节,如下所示:

For code points U+0000 through U+007F, each codepoint is one byte long and looks like this:

0xxxxxxx   // a

对于代码点U + 0080至U + 07FF,每个代码点长为两个字节,如下所示:

For code points U+0080 through U+07FF, each codepoint is two bytes long and look like this:

110xxxxx 10xxxxxx  // b

以此类推.

因此,要构造一个长度为一个字节的非法UTF-8字符,最高位必须为1(与模式a不同),第二高位必须为0(与模式b不同):

So, to construct an illegal UTF-8 character that is one byte long, the highest bit must be 1 (to be different from pattern a) and the second highest bit must be 0 (to be different from pattern b):

10xxxxxx

111xxxxx

这两种模式也有所不同.

Which also differs from both patterns.

使用相同的逻辑,您可以构造长度超过两个字节的非法代码单元序列.

With the same logic, you can construct illegal codeunit sequences which are more than two bytes long.

您没有标记语言,但是我必须对其进行测试,因此我使用Java:

You did not tag a language, but I had to test it, so I used Java:

for (int i=0;i<255;i++) {
    System.out.println( 
        i + " " + 
        (byte)i + " " + 
        Integer.toHexString(i) + " " + 
        String.format("%8s", Integer.toBinaryString(i)).replace(' ', '0') + " " + 
        new String(new byte[]{(byte)i},"UTF-8")
    );
}

0到31是不可打印字符,然后32是空格,后跟可打印字符:

0 to 31 are non-printable characters, then 32 is space, followed by printable characters:

...
31 31 1f 00011111 
32 32 20 00100000  
33 33 21 00100001 !
...
126 126 7e 01111110 ~
127 127 7f 01111111 
128 -128 80 10000000 �

delete0x7f,其后从128个字符(包括254个字符)一直到254个字符,均不会打印有效字符.您还可以从 UTF-8图表中看到

delete is 0x7f and after it, from 128 inclusively up to 254 no valid characters are printed. You can see from the UTF-8 chartable also:

代码点U+007F用一个字节0x7F(位01111111)表示,而代码点U+0080用两个字节0xC2 0x80(位11000010 10000000)表示.

Codepoint U+007F is represented with one byte 0x7F (bits 01111111), while codepoint U+0080 is represented with two bytes 0xC2 0x80 (bits 11000010 10000000).

如果您不熟悉UTF-8,我强烈建议您阅读这篇出色的文章:

If you are not familiar with UTF-8 I strongly recommend reading this excellent article:

每个软件开发人员绝对,肯定地必须绝对了解Unicode和字符集(没有任何借口! )

这篇关于如何生成非UTF-8字符集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆