希腊字符串的长度大于应有的长度 [英] Length of Greek character string is larger than it should be

查看:80
本文介绍了希腊字符串的长度大于应有的长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写程序,我将一串希腊字符作为输入,当我打印其len时,它将输出其double值。例如,如果ch =ΑΒ(希腊字符)或ch =αβ,

I'm writing a program and I take a string of Greek characters as input and when I print its len, it outputs its double. For example, if ch="ΑΒ"(greek characters) or ch="αβ",

printf(%d,strlen( ch)); 输出4而不是2。如果ch = ab,则输出2。这是怎么回事?

printf("%d",strlen(ch)); outputs 4 instead of 2. And if ch="ab", it outputs 2. What's going on?

推荐答案

可能是因为您的字符串是使用可变宽度字符编码来编码的。

Probably because your string is encoded using variable-width character encoding.

在过去,我们只打扰了128个不同的字符: AZ,AZ,0-9,以及一些逗号和方括号来控制内容。一切都以7位完成,我们称之为ASCII。然后这还不够,我们添加了其他一些东西,例如在顶部带有线条或点的字母,然后我们转到8位(1个字节),可以在一个字节中执行256个字符中的任何一个。 (尽管人们对于在额外的128个插槽中应该使用什么版本的想法因其语言中最有用的而有很大差异,请参见usr2564301的注释,然后您必须说出在那些额外的插槽中应该使用哪个版本。)

In the good old days, we only bothered with 128 different characters: a-z, A-Z, 0-9, and some commas and brackets and control things. Everything was taken care of in 7 bits, and we called it ASCII. Then that wasn't enough and we added some other things like letters with lines or dots on top, and we went to 8 bits (1 byte) and could do any of 256 characters in one byte. (Although people's ideas of what should go in those extra 128 slots varied widely, based on what was most useful in their language - see comment from usr2564301 - and you then had to say whose version you were using for what should be in those extra slots.)

如果您的字符串中包含2个字符,则长度为2个字节(也许还有一个空终止符)。

If you had 2 characters in your string, it would be 2 bytes long (plus a null terminator perhaps), always.

但是后来人们意识到,英语不是世界上唯一的语言,事实上,全球有成千上万种使用数百种语言的字母。现在该怎么办?

But then people woke up to the fact that English isn't the only language in the world, and there were in fact thousands of letters in hundreds of languages around the globe. Now what to do?

我们可以说只有大约65,000个字符使我们感兴趣,并将所有字母编码为两个字节。有一些编码格式可以做到这一点。这样,一个两个字母的字符串将始终为4个字节(总和,可能在前面带有某些字节顺序标记,而在结尾可能是空终止符)。两个问题:a)与ASCII的兼容性不是很向后兼容,以及b)如果大多数文本无论如何都是ASCII字符集良好的东西,则会浪费字节。

Well, we could say there are only about 65,000 characters that interest us, and encode all letters in two bytes. There are some encoding formats that do this. A two-letter string will then always be 4 bytes (um, perhaps with some byte order mark at the front, and maybe a null terminator at the end). Two problems: a) not very backwards compatible with ASCII, and b) wasteful of bytes if most text is stuff that is in the good ol' ASCII character set anyway.

进入UTF-8,我将赌注是您的字符串用于其编码的方式或类似方式。 ASCII字符(例如 a和 b)用一个字节编码,而更多的外来字符(从英语的角度来看,-blush-)占用一个以上的字节,其中第一个字节表示后面跟着这个字节代表字母。这样就得到了可变宽度编码。因此,两个字母的字符串的长度至少应为两个字节,但是如果包含非ASCII字符,则长度会更大。

Step in UTF-8, which I'll wager is what your string is using for its encoding, or something similar. ASCII characters, like 'a' and 'b', are encoded with one byte, and more exotic characters (--blush-- from an English-speaking perspective) take up more than one byte, of which the first byte is to say "what follows is to be taken along with this byte to represent a letter". So you get variable-width encoding. So the length of a two-letter string will be at least two bytes, but if it includes non-ASCII characters, it'll be more.

这篇关于希腊字符串的长度大于应有的长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆