更优雅的UTF-8编码器 [英] More elegant UTF-8 encoder

查看:56
本文介绍了更优雅的UTF-8编码器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




对于一个免费软件项目,我必须编写一个例程,给出一个

Unicode标量值U + 0000 - U + 10FFFF,返回一个整数,其中包含

的UTF-8编码形式,例如,U + 00F6变为0x0000C3B6。

我想出了以下内容。我正在寻找一个更优雅的解决方案,

,大致,更快,更短,更具可读性......同时生产

引用范围的相同输出。


unsigned int

utf8toint(unsigned int c){

unsigned int len,res,i;


if(c <0x80)返回c;


len = c< 0x800? 1:c< 0x10000? 2:3;


/ *这可以用数组查找替换* /

res =(2<<< len) - 1< < (7-len)<< len * 8;


for(i = len; i 0; - i,c>> = 6)

res | =(( c& 0x3f)| 0x80)<< (len - i)* 8;


/ *虽然异常,但想要的结果是int * /

返回res | c<< len * 8;

}


有什么想法吗?谢谢,

-

Bj?rn H?hrmann·mailto:bj **** @ hoehrmann.de· http://bjoern.hoehrmann.de

Weinh。海峡。 22·Telefon:+49(0)621/4309674· http://www.bjoernsworld.de

68309曼海姆·PGP Pub。 KeyID:0xA4357E78· http://www.websitedev.de/

解决方案

6月10日下午2:42,Bjoern Hoehrmann< bjo ... @ hoehrmann.dewrote:





对于一个免费软件项目,我必须编写一个例程,给出一个

Unicode标量值U + 0000 - U + 10FFFF,返回一个整数,其中包含

的UTF-8编码形式,例如,U + 00F6变为0x0000C3B6。

我想出了以下内容。我正在寻找一个更优雅的解决方案,

,大致,更快,更短,更具可读性......同时生产

引用范围的相同输出。


unsigned int

utf8toint(unsigned int c){

unsigned int len,res,i;


if(c <0x80)返回c;


len = c< 0x800? 1:c< 0x10000? 2:3;


/ *这可以用数组查找替换* /

res =(2<<< len) - 1< < (7-len)<< len * 8;


for(i = len; i 0; - i,c>> = 6)

res | =(( c& 0x3f)| 0x80)<< (len - i)* 8;


/ *虽然异常,但想要的结果是int * /

返回res | c<< len * 8;

}


有什么想法吗?谢谢,

-

Bj?rn H?hrmann·mailto:bjo ... @ hoehrmann.de· http://bjoern.hoehrmann.de

Weinh。海峡。 22·Telefon:+49(0)621/4309674· http://www.bjoernsworld.de

68309曼海姆·PGP Pub。 KeyID:0xA4357E78· http://www.websitedev.de/



你想要做的事情似乎很奇怪。如果要在32位数字中编码

Unicode,请保持不变。如果你想将
Unicode编码为一个字节序列,将它存储到一个字节序列中。


我绝对会拒绝查看包含表达式的代码

喜欢res | c<< len * 8没有括号。


在文章< m9 *********************** *********@hive.bjoern.ho ehrmann.de>,

Bjoern Hoehrmann< bj **** @ hoehrmann.dewrote:


>我正在寻找一个更优雅的解决方案,
即大致,更快,更短,更具可读性



选择任意两个。


说实话,我不明白这一点。它看起来足够快:毕竟,你必须从某个地方读取数据,这可能会慢得多。除非你有分析数据显示它是一个很大的开销,否则就算了。至于更清楚,这取决于你从哪里开始。如果你想匹配UTF-8的典型文字

描述,我认为这样的事情要清楚得多:


unsigned char b [4] = {0,0,0,0};


if(c <0x80)

b [0] = c;

否则(c <0x800)

{

b [1] = 0xc0 +(c> 6);

b [0]如果(c <0x10000)

{

b [2] = 0xe0 +(c> 12);

b [1] = 0x80 +((c> 6)& 0x3f);

b [ 0] = 0x80 +(c& 0x3f);

}

其他

{

b [3] = 0xf0 +(c> 18);

b [2] = 0x80 +((c> 12)& 0x3f);

b [1] = 0x80 + ((c> 6)& 0x3f);

b [0] = 0x80 +(c& 0x3f);

}


返回b [0] +(b [1]<< 8)+(b [2]<< 16)+(b [3]<<<<<<<<<<<

这是未经测试的,并且源自用于输出
序列中的字节的代码。当然你可以用

替换组成部分的表达式来替换数组赋值。


- Richard

-

应考虑在一些字母表中需要多达32个字符

- 1963年的X3.4。


6月10日上午6:42,Bjoern Hoehrmann< bjo ... @ hoehrmann.dewrote:





对于一个免费软件项目,我必须编写一个例程,给出一个

Unicode标量值U + 0000 - U + 10FFFF,返回一个整数,其中包含

的UTF-8编码形式,例如,U + 00F6变为0x0000C3B6。

我来了以下内容。我正在寻找一个更优雅的解决方案,

,大致,更快,更短,更具可读性......同时生产

引用范围的相同输出。


unsigned int

utf8toint(unsigned int c){

unsigned int len,res,i;


if(c <0x80)返回c;


len = c< 0x800? 1:c< 0x10000? 2:3;


/ *这可以用数组查找替换* /

res =(2<<< len) - 1< < (7-len)<< len * 8;


for(i = len; i 0; - i,c>> = 6)

res | =(( c& 0x3f)| 0x80)<< (len - i)* 8;


/ *虽然异常,但想要的结果是int * /

返回res | c<< len * 8;

}


有什么想法吗?谢谢,



我会查看(或只是使用)免费代码进行此类转换

可用于Unicode网站。这显然创建了一个包含UTF-8编码的字节数组,但是你可以轻松地转换结果或修改代码。你似乎

有一个奇怪的要求 - 如果UTF-8编码

需要比任何可用整数类型更多的字节怎么办?


Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Bj?rn H?hrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

解决方案

On Jun 10, 2:42 pm, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,
--
Bj?rn H?hrmann · mailto:bjo...@hoehrmann.de ·http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 ·http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 ·http://www.websitedev.de/

What you are trying to do seems rather bizarre. If you want to encode
Unicode in a 32 bit number, leave it unchanged. If you want to encode
Unicode as a sequence of bytes, store it into a sequence of bytes.

And I would absolutely refuse reviewing code containing an expression
like "res | c << len * 8" without parentheses.


In article <m9********************************@hive.bjoern.ho ehrmann.de>,
Bjoern Hoehrmann <bj****@hoehrmann.dewrote:

>I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable

"Choose any two".

To be honest, I don''t see the point. It looks fast enough: after all,
you must be reading the data from somewhere, which is likely to be
much slower. Unless you have profiling data showing that it''s a
significant overhead, forget it. As for clearer, it depends where
you''re starting from. If you want to match a typical textual
description of UTF-8, I think something like this is much clearer:

unsigned char b[4] = {0, 0, 0, 0};

if(c < 0x80)
b[0] = c;
else if(c < 0x800)
{
b[1] = 0xc0 + (c >6);
b[0] = 0x80 + (c & 0x3f);
}
else if(c < 0x10000)
{
b[2] = 0xe0 + (c >12);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}
else
{
b[3] = 0xf0 + (c >18);
b[2] = 0x80 + ((c >12) & 0x3f);
b[1] = 0x80 + ((c >6) & 0x3f);
b[0] = 0x80 + (c & 0x3f);
}

return b[0] + (b[1] << 8) + (b[2] << 16) + (b[3] << 24);

That''s untested and derived from code intended to output bytes in
sequence. Of course you could replace the array assignments with
returns of expressions composing the parts.

-- Richard
--
"Consideration shall be given to the need for as many as 32 characters
in some alphabets" - X3.4, 1963.


On Jun 10, 6:42 am, Bjoern Hoehrmann <bjo...@hoehrmann.dewrote:

Hi,

For a free software project, I had to write a routine that, given a
Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds
the UTF-8 encoded form of it, for example, U+00F6 becomes 0x0000C3B6.
I came up with the following. I am looking for a more elegant solution,
that is, roughly, faster, shorter, more readable, ... while producing
the same ouput for the cited range.

unsigned int
utf8toint(unsigned int c) {
unsigned int len, res, i;

if (c < 0x80) return c;

len = c < 0x800 ? 1 : c < 0x10000 ? 2 : 3;

/* this could be replaced with a array lookup */
res = (2 << len) - 1 << (7 - len) << len * 8;

for (i = len; i 0; --i, c >>= 6)
res |= ((c & 0x3f) | 0x80) << (len - i) * 8;

/* while unusual, the desired result is an int */
return res | c << len * 8;
}

Any ideas? Thanks,

I''d have a look at (or just use) the free code to do such conversions
which is available on the Unicode web site. That does the obvious
thing of creating an array of bytes holding the UTF-8 encoding, but
you could easily convert that result or modify the code. You seem to
have a bizarre requirement though - what if the UTF-8 encoding
requires more bytes than any available integer type?


这篇关于更优雅的UTF-8编码器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆