您将如何创建所有UTF-8字符的字符串? [英] How would you create a string of all UTF-8 characters?

查看:159
本文介绍了您将如何创建所有UTF-8字符的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多方法可以表示+1百万个 UTF-8字符.取带有大写字母(Ā)的拉丁大写字母"A".这是Unicode代码点U+0100,十六进制数0xc4 0x80,十进制数196 128和二进制11000100 10000000.

There are many ways to represent the +1 million UTF-8 characters. Take the latin capital "A" with macron (Ā). This is unicode code point U+0100, hex number 0xc4 0x80, decimal number 196 128, and binary 11000100 10000000.

我想创建用于测试应用程序的前65,535个UTF-8字符的集合.这些都是直到代码点U+FFFF(byte3)的unicode字符.

I would like to create a collection of the first 65,535 UTF-8 characters for use in testing applications. These are all unicode characters up to code point U+FFFF (byte3).

是否可以执行类似for($x=0)循环的操作,然后将结果十进制转换为另一个基数(例如十六进制),从而允许创建匹配的unicode字符?

Is it possible to do something like a for($x=0) loop and then convert the resulting decimal to another base (like hex) which would allow the creation of the matching unicode character?

我可以使用以下方式创建值Ā:

I can create the value Ā using something like this:

$char = "\xc4\x80";
// or
$char = chr(196).chr(128);

但是,我不确定如何将其转变为自动化过程.

However, I am not sure how to turn this into an automated process.

// fail!
$char = "\x". dechex($a). "\x". dexhex($b);

推荐答案

您可以利用iconv(或其他一些函数)将代码点号转换为UTF-8字符串:

You can leverage iconv (or a few other functions) to convert a code point number to a UTF-8 string:

function unichr($i)
{
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

$codeunits = array();
for ($i = 0; $i<0xD800; $i++)
    $codeunits[] = unichr($i);
for ($i = 0xE000; $i<0xFFFF; $i++)
    $codeunits[] = unichr($i);
$all = implode($codeunits);

(我避免使用替代范围0xD800–0xDFFF,因为它们本身不能有效地放入UTF-8;那应该是"CESU-8".)

(I avoided the surrogate range 0xD800–0xDFFF as they aren't valid to put in UTF-8 themselves; that would be "CESU-8".)

这篇关于您将如何创建所有UTF-8字符的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆