用 Lua 提取 UTF-8 字符串的第一个字母 [英] Extract the first letter of a UTF-8 string with Lua
问题描述
有没有办法用Lua提取UTF-8编码字符串的第一个字母?
Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua 没有正确支持 Unicode,所以 string.sub("ÆØÅ", 2, 2)
将返回 "?"
而不是 "Ø"
.
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2)
will return "?"
rather than "Ø"
.
是否有一个相对简单的 UTF-8 解析算法,我可以在每个字节的字符串字节上使用,其唯一目的是获取字符串的第一个字母,是汉字还是 A?
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
还是这种方式太复杂,需要庞大的库等?
Or is this way too complex, requiring a huge library, etc.?
推荐答案
您可以使用以下代码轻松地从 UTF-8 编码的字符串中提取第一个字母:
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z1-127194-244][128-191]*")
end
因为 UTF-8 代码点要么以 0 到 127 的字节开头,要么以 194 到 244 的字节后跟一个或多个 128 到 191 的字节开始.
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
您甚至可以以类似的方式迭代 UTF-8 代码点:
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z1-127194-244][128-191]*") do
print(code)
end
请注意,两个示例都为每个字母返回一个 string 值,而不是 Unicode 代码点数值.
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
这篇关于用 Lua 提取 UTF-8 字符串的第一个字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!