在C中将utf-16转换为utf-8 [英] Convert utf-16 to utf-8 in C

查看:174
本文介绍了在C中将utf-16转换为utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大学任务,我需要一些帮助。不要给我解决方案;提示或一小部分代码将不胜感激。



所以,我的大学项目是关于unicode的。确切地说,我必须编写以utf-16格式输入字符的代码,将其转换为utf-8并将其放在适当的出口(终端控制台或file.txt)中,同时我还要执行以下操作: />
1)不要使用数组

2)使用putchar

3)使用getchar



注意:我在那里的第二年,但如果我不使用指针和scanf,那将是最好的。



如果我的教授正在观看论坛,我宁愿不发布代码。

这是我的开始:

I have a university assignment i need some help with. Don't give me the solution; hints or small portions of code would be appreciated.

So, my university project is all about unicode. To be exact, I have to write code that takes character input in utf-16 format, converts it to utf-8 and places it in the appropriate exit (terminal console or file.txt), whilst I also do the following:
1)Don't use arrays
2)Use putchar
3)Use getchar

Note: I am in my second year there, but it would be best if I did not use pointers and scanf.

I'd rather not post code unless necessary, in case my professor is watching the forums.
Here's my start:

int main(){
int char1 = 0,char2;
while (char1 != EOF) 
{
		char1 = getchar();
		char2 = getchar();
		char1 <<= 8;
		char1 += char2;

		if (char1 >= 0xD800 && char1 <= 0xDBFF) {
			
			char2 = getchar();
			int tempchar = getchar();
			char2<<= 8;
			char2 += tempchar;

			if (char2 >= 0xDC00 && char2 <= 0xDFFF)
			{
				char1 -= 0xD800;
				char2 -= 0xDC00;
				char1 <<= 10;
				char1 += char2;
				char1 += 0x010000;
				//write code that converts to utf 8

			}
 else if((char1 >= 0x0000 && char1 <= 0xD7FF )||(char1 >= 0xE000 && char1 <= 0xFFFF)){
			//write code that converts to utf 8
		}
}



我的代码到目前为止是否正确?转移是对的吗?如果没有向我解释我是如何使它工作的。


Is my code up to this point correct? Is the shifting right? If not explain to me how I could make it work.

推荐答案

首先,你没有展示你的对象如何命名 char ... 被宣布。你需要对32位无符号整数进行所有计算;在其他情况下,大小不足以代表BMP之外的代码点



我没有检查UTF16部分,但是缺少至少一部分:应该有两个不同的分支:一个用于UTF16LE,另一个用于UTF16BE。在每种情况下,首先检查您是否正在阅读代理对,然后以对话形式计算代码点的内部表示形式无符号的32位整数。对于大端,所有表示都被翻转,包括代理对本身。其他代码点应由16位字组成;并且其无符号整数解释将在算术上等于代码点值。请参阅:

https://en.wikipedia.org/wiki/Endianness [< a href =https://en.wikipedia.org/wiki/Endiannesstarget =_ blanktitle =New Window> ^ ],

https://en.wikipedia.org/wiki/UTF-16 [ ^ ]。



第一阶段的目标是逐个字符地解释UTF16编码,每个字符应表示为16位无符号值,该值应在算术上等于代码点。在这里,您需要意识到Unicode代码点是代表基数值的数学抽象;从任何类型的计算机表示中抽象出这些数据的按位表示。它们只是抽象的数学值。



现在,UTF-8也是可变宽度编码。它使用非常狡猾的算法,冗余度非常低。例如,这里有完整的描述: https://en.wikipedia.org/wiki/UTF-8 [ ^ ]。



只需按照算法说明操作即可。我不认为这太复杂了。



UTF-16或UTF-8流还有另一个可选功能:BOM。这是可选的标记。您需要决定如何处理缺少标记的文本。如果未找到标记,则可以拒绝处理,或者您需要具有指定预期编码的其他函数。那应该是你的设计。请参阅: http://unicode.org/faq/utf_bom.html [ ^ ]。



最后,一个精致point:两种编码都允许无效的代码点。在您的特定问题中,UTF-8永远不是源,因此您可能遇到的所有问题都是UTF-16。例如,如果在遇到第一个代理对之前面对代理对的第二个成员,则这是无效数据。如果非代理词周围只有一个代理对的成员,则这是无效数据。所以,你必须决定如何处理这类案件;这应该只是一个自愿的决定。这应该是你的设计。



我希望我做了你想做的一切:没有代码,但现在你有所有的来龙去脉。它清楚了吗?



-SA
First, you did not show how your objects named char… are declared. You need to do all the calculations on 32-bit unsigned integer; in other cases, the size would be not enough to represent a code point beyond BMP.

I did not check up UTF16 part, but at least one part is missing: there should be two different branches: one for UTF16LE and another for UTF16BE. In each of the cases, you first check up if you are reading a surrogate pair and then calculate your internal representation of a code point out of the pair, in the form of unsigned 32-bit integer. For big endian, all representations are flipped, including the surrogate pairs themselves. Other code points should be composed out of 16-bit words; and its unsigned integer interpretation will be arithmetically equal to a code point value. Please see:
https://en.wikipedia.org/wiki/Endianness[^],
https://en.wikipedia.org/wiki/UTF-16[^].

The goal of first stage is to interpret UTF16 encoding character by character, and each character should be represented as 16-bit unsigned value which should be arithmetically equal to the code point. Here, you need to realize that Unicode code points are mathematical abstraction representing cardinal value; they are abstracted from the bitwise representation of this data, from any kind of computer representation. They are just abstract mathematical values.

Now, UTF-8 is also variable-width encoding. It uses pretty cunning algorithm with very low redundancy. It is fully described, for example, here: https://en.wikipedia.org/wiki/UTF-8[^].

Just follow the algorithm description. I don't think it's anything too complicated.

There is another optional feature of the UTF-16 or UTF-8 streams: the BOM. This is the marker which is optional. You need to decide what to do with text with absent marker. You can deny processing if the marker is not found, or you need to have another function where the expected encoding is specified. That should be your design. Please see: http://unicode.org/faq/utf_bom.html[^].

And finally, one delicate point: both encodings allow invalid code points. In your particular problem, UTF-8 is never a source, so all problems you may have are with UTF-16. If, for example, you face a second member of a surrogate pair before the first one is encountered, this is invalid data. If you have only one member of a surrogate pairs surrounding by the non-surrogate words, this is invalid data. So, you have to decide what to do with such cases; and this should be just a voluntary decision. It should be by your design.

I hope I did all you wanted: no code, but now you have all ins and outs. It it clear?

—SA


这篇关于在C中将utf-16转换为utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆