如何在Ruby中的字符串中检测CJK字符? [英] How can I detect CJK characters in a string in Ruby?

查看:198
本文介绍了如何在Ruby中的字符串中检测CJK字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定Ruby 1.8.7中的字符串(没有使用\p {}支持Unicode属性的真棒Oniguruma正则表达式引擎),我想要能够确定字符串是否包含一个或多个中文,日文,或韩语字符;即

Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.

class String
  def contains_cjk?
    ...
  end
end

>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false

我怀疑这将会归结为查看字符串中的任何字符是否在 Unihan CJKV Unicode块,但我想如果有人知道Ruby中现有的解决方案是值得的。 p>

I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.

推荐答案

(ruby 1.9.2)

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p {}匹配字符的Unicode脚本。

支持以下脚本:阿拉伯语,亚美尼亚语,巴厘语,孟加拉语,Bopomofo,Braille, ,Buhid,Canadian_Aboriginal,Carian,Cham,Cherokee,Common,Coptic,Cuneiform,Cypriot,Cyrillic,Deseret,Devanagari,Ethiopic,Georgian,Glagolitic,Gothic,Greek,Gujarati,Gurmukhi,Han,Hangul,Hanunoo,Hebrew,Hiragana,Inherited ,Kannada,Katakana,Kayah_Li,Kharoshthi,Khmer,Lao,Latin,Lepcha,Limbu,Linear_B,Lycian,Lydian,Malayalam,Mongolian,New_Tai_Lue,Nko,Ogham,Ol_Chiki,Old_Italic,Old_Persian,Oriya,Osmanya,Phags_Pa,Phoenician ,Rejang,Runic,Saurashtra,Shavian,Sinhala,Sundanese,Syloti_Nagri,Syriac,Tagalog,Tagbanwa,Tai_Le,Tamil,Telugu,Thaana,Thai,Tibetan,Tifinagh,Ugaritic,Vai和Yi。

\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmuk Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharosht Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

哇。 Ruby Regexp来源

这篇关于如何在Ruby中的字符串中检测CJK字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆