用Lua提取UTF-8字符串的第一个字母 [英] Extract the first letter of a UTF-8 string with Lua

查看:483
本文介绍了用Lua提取UTF-8字符串的第一个字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法用Lua提取UTF-8编码的字符串的首个字母?

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?

Lua不正确支持Unicode,因此string.sub("ÆØÅ", 2, 2)将返回"?"而不是"Ø".

Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".

我是否可以在每个字节的字符串字节上使用相对简单的UTF-8解析算法,其唯一目的是获取字符串的第一个字母(是中文字符还是A)?

Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?

或者这种方法是否太复杂,需要庞大的库等?

Or is this way too complex, requiring a huge library, etc.?

推荐答案

您可以使用以下代码轻松地从UTF-8编码的字符串中提取第一个字母:

You can easily extract the first letter from a UTF-8 encoded string with the following code:

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

因为UTF-8代码点要么从0到127的字节开始,要么从194到244的字节开始,然后是从128到191的一个或几个字节.

Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.

您甚至可以类似的方式在UTF-8代码点上迭代:

You can even iterate over UTF-8 code points in a similar manner:

for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
  print(code)
end

请注意,两个示例都为每个字母返回一个 string 值,而不是Unicode代码点的数值.

Note that both examples return a string value for each letter, and not the Unicode code point numerical value.

这篇关于用Lua提取UTF-8字符串的第一个字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆