如何读取UTF-8字符串赋予其字符长度以纯C89? [英] How to read UTF-8 string given its length in characters in plain C89?
问题描述
我在写纯C89自定义跨平台的简约TCP服务器。 (不过我也将接受特定POSIX标准的答案。)
I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)
服务器使用UTF-8字符串,但从来没有看起来里面它们。它把所有的字符串作为永恒不变的二进制斑点。
The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.
但现在我需要从一个不知道如何计算字节大小的客户端接受UTF-8字符串。客户端只能在传输字符字符串长度。的(更新:客户是在JavaScript中,和长度字符,其实,无论 string.length减()
回报我假设它是实际的UTF -8字符,而不是别的东西。)的
But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length()
returns. I assume it is actual UTF-8 characters, not something else.)
我不想重依赖添加到我的小服务器。有没有读这数据报一个强大的和简洁的方式? (对于这个问题的缘故,让我们说,这是从 FILE *
阅读。)
I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *
.)
U<CRLF> ; data type marker (actually read by dispatching code)
<SIZE><CRLF> ; UTF-8 string size in characters
<DATA><CRLF> ; data blob
例如:
U
7
Юникод!
的更新:的
数据的一个批次可以包含多个数据包,所以大概读是行不通的,我需要阅读的字符确切数额。
One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.
和实际的UTF-8的数据可以包含任何字符,所以我不能选择一个字符作为终结 - 我不想惹的数据逃脱它
And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.
推荐答案
这看起来正是我需要的东西。祝我发现它前面:
This looks like exactly the thing I'd need. Wish I found it earlier:
http://bjoern.hoehrmann.de/utf-8 /德codeR / DFA /
这篇关于如何读取UTF-8字符串赋予其字符长度以纯C89?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!