char *中的UTF-8 [英] UTF-8 in char*

查看:153
本文介绍了char *中的UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我正在开发一个必须支持UTF-8的vCard应用程序。在char *中,

UTF-8是否会使strlen崩溃,我的意思是UTF-8有一些字符串

在strlen中视为NULL字符吗?


Jacky

Hi,

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

Jacky

推荐答案

张学友< ja ***** @ yahoo.com>潦草地写道:
Jacky Cheung <ja*****@yahoo.com> scribbled the following:

我正在开发一个必须支持UTF-8的vCard应用程序。在char *中的
UTF-8是否会使strlen崩溃,我的意思是UTF-8有一些char在strlen中被视为NULL字符吗?
Hi, I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?




AFAIK UTF-8没有NUL字符(大多数人喜欢拼写它/ b $ b以避免混淆)。 UTF-8仅包括正常状态。 ASCII

字符和第7位设置的特殊字符。你在UTF-8中看到NUL比在ASCII中看到NUL的危险更多了。

请注意,vCards本身完全关闭-topic这里。


-

/ - Joona Palaste(pa*****@cc.helsinki.fi)----- --------芬兰-------- \

\-- http://www.helsinki.fi/~palaste ---------------------规则! -------- /

他说:''我不是猫王''。除了猫王之外还有谁能这么说?

- ALF



AFAIK UTF-8 does not have NUL characters (most people prefer to spell it
that way to avoid confusion). UTF-8 only includes "normal" ASCII
characters and special characters with bit 7 set. You''re in no more
danger of seeing NUL in UTF-8 than you are of seeing it in ASCII.
Note that vCards, by themselves, are completely off-topic here.

--
/-- Joona Palaste (pa*****@cc.helsinki.fi) ------------- Finland --------\
\-- http://www.helsinki.fi/~palaste --------------------- rules! --------/
"He said: ''I''m not Elvis''. Who else but Elvis could have said that?"
- ALF


Jacky Cheung写道:
Jacky Cheung wrote:
我正在开发一个必须支持UTF-8的vCard应用程序。在char *中UTF-8会崩溃strlen,我的意思是UTF-8有一些
char在strlen中被视为NULL字符吗?
I am developing a vCard application which have to support UTF-8. Does
the UTF-8 in char* will crash the strlen, I mean does UTF-8 have some
char which treat as NULL character in strlen?




好​​吧,它有一个空控制字符,但它或多或少意味着与ASCII空字符相同的

。因此,如果您只想处理正常的

文本,您可以使用普通的C字符串,因此strlen()。

BTW,如果您之前只编写ASCII程序,你可能会注意到

,像getchar()这样的函数会返回

''unsigned char''或EOF范围内的字符值,而''char''可以是负。所以代码如

char buf [] ="< UTF-8 string>" ;;

int ch,i;

。 ..

while((ch = getchar())!= EOF){

if(ch == buf [i])...

错了。 (即使你不使用UTF-8,但你之前可能没有注意到

。)你需要将ch转换为char或buf [j]转换为unsigned char

比较两者之前。


-

Hallvard



Well, it has a null control character, but it means more or less the
same as the ASCII null character. So if you just want to handle normal
text, you can use normal C strings, and thus strlen().
BTW, if you have only written programs for ASCII before, you might note
that functions like getchar() return character values in the range of
''unsigned char'' or EOF, while ''char'' can be negative. So code like
char buf[] = "<UTF-8 string>";
int ch, i;
...
while ((ch = getchar()) != EOF) {
if (ch == buf[i]) ...
is wrong. (Even if you don''t use UTF-8, but you may not have noticed
before.) You need to convert ch to char or buf[j] to unsigned char
before comparing the two.

--
Hallvard


文章<新闻:br *********** @ imsp212.netvigator.com>

Jacky Cheung< ja ***** @ yahoo.com>写道:
In article <news:br***********@imsp212.netvigator.com>
Jacky Cheung <ja*****@yahoo.com> writes:
我正在开发一个必须支持UTF-8的vCard应用程序。在char *中的
UTF-8是否会使strlen崩溃,我的意思是UTF-8有一些char在strlen中被视为NULL字符吗?
I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?




UTF-8只是一种编码机制,用于获取大于8位的
值并将其存储为8位值。这个

机制的细节在comp.lang.c中几乎不是主题,但是在这里我们

可以说UTF-8编码的字符总是适合对象

类型为unsigned char,因为它们至少有8位。


您上面的实际问题不能(完全)按照要求回答

它似乎包含至少一个错误的假设,即

在unsigned char数组中出现''\ 0''字符将

崩溃 strlen的()。实际上,strlen()只是在一个数组上运行

(简单,即由编译器自行决定签名)

char,向前搜索直到它找到一个''\0''值,然后返回

它跳过的非''\0'' - " char'的数量。传递strlen()

char数组的地址* *不包含''\ 0''

可能导致程序崩溃(或者确实表现出任何行为

);所以我认为你真正要问的是:


给定一些宽于8位
字符集的值序列(例如16或32位Unicode),假设我使用UTF-8方案以8位字节对其进行编码。我可以

(有用)将strlen()应用于结果吗?"


此版本问题的答案是可能。特别是,

你必须确保:


a)8位值中没有一个是普通的陷阱表示

"炭"如果简单的char已签名(并且C语言本身是

在这里不是非常有帮助,但你可以将自己限制在

两个补充系统或那些具有足够广泛的补充系统)通过检查CHAR_MAX> = 255 - 即没有UTF-8值

将为负 - 或者-CHAR_MIN< =通过简单的chars,

-128);


b)char您用来存储编码的数组
值是''\ 0'' - 终止;


c)你没有嵌入任何''\\' \\ n''在该数组中的值,以及


d)生成的strlen()值符合任何其他条件

您可以隐藏在单词有用的。


今天大多数C系统都满足(a)部分的条件,所以你可以简单地假设它们(并在某处记录这个假设) )。

(b)和(c)部分中的条件可能会或可能不会自然产生

,你是UTF-8编码的值 - 这个部分取决于你。

部分(d)同样是你能回答的问题。

-

In-Real-Life: Wind River Systems的Chris Torek

美国犹他州盐湖城(40°39.22''N,111°50.29''W)+1 801 277 2603

email :忘了它 http:// web .torek.net / torek / index.html

由于垃圾邮件发送者,阅读电子邮件就像在垃圾中搜索食物一样。



UTF-8 is simply an encoding mechanism for taking larger-than-8-bit
values and storing them in 8-bit values. The details of this
mechanism are pretty much off-topic in comp.lang.c, but here we
can say that UTF-8 encoded characters will always fit in objects
of type "unsigned char", as those will have at least 8 bits.

Your actual question above cannot (quite) be answered as asked as
it appears to contain at least one false assumption, i.e., that
the presence of a ''\0'' character in an array of unsigned char will
"crash" strlen(). In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler''s discretion)
char, searching forward until it finds a ''\0'' value, then returning
the number of non-''\0''-"char"s it has skipped. Passing strlen()
the address of an array of "char" that does *not* contain a ''\0''
could cause the program to crash (or indeed exhibit any behavior
at all); so I think what you really mean to ask is:

"Given some sequence of values in some wider-than-8-bit
character set (such as 16 or 32 bit Unicode), suppose I have
encoded it in 8-bit bytes using the UTF-8 scheme. Can I
(usefully) apply strlen() to the result?"

The answer to this version of the question is "maybe". In particular,
you must ensure that:

a) none of the 8-bit values is a trap representation in plain
"char" if plain "char" is signed (and the C language proper is
not terribly helpful here, but you could constrain yourself to
two''s complement systems or those with wide-enough "plain" chars,
by checking that either CHAR_MAX >= 255 -- i.e., no UTF-8 value
will be negative -- or that -CHAR_MIN <= -128);

b) that the "char" array you have used to stored the encoded
values is ''\0''-terminated;

c) that you did not embed any ''\0'' values in that array, and

d) that the resulting strlen() value meets any other criteria
you may hide beneath the word "useful".

The conditions in part (a) are met by most C systems today, so you
might simply assume them (and document that assumption somewhere).
The conditions in part (b) and (c) may, or may not, arise naturally
out of the values you are UTF-8 encoding -- this part is up to you.
Part (d) is likewise something only you can answer.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22''N, 111°50.29''W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.


这篇关于char *中的UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆