如何在Ruby中检测字符串中的某些Unicode字符? [英] How can I detect certain Unicode characters in a string in Ruby?
问题描述
class String
def contains_cjk?
...
end
end
>> 日本语 .contains_cjk?
=> true
>> '광고프로그램'.contains_cjk?
=> true
>> 艾弗森将退出篮坛 .contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu。'。contains_cjk?
=> false
我怀疑这将会归结为查看字符串中的任何字符是否在 Unihan CJKV Unicode区块,但我认为值得问是否有人知道现有的解决方案
(ruby 1.9.2)
#encoding:UTF-8
class String
def contains_cjk?
!!(self =〜/ \p {Han} | \p {Katakana} | \p {Hiragana} | \p {Hangul} /)
end
end
strings = ['日本','광고프로그램','艾弗森将退出篮坛','Watashi ha bakana gaijin desu']
strings.each {| s | put s.contains_cjk?}
#true
#true
#true
#false
\p {}匹配一个字符的Unicode脚本。
支持以下脚本:阿拉伯语,亚美尼亚语,巴厘语,孟加拉语,Bopomofo,盲文,Buginese ,布什,加拿大,巴基斯坦,坎Cham,切诺基,普通科普特,楔形文字,塞浦路斯,西里尔,德雷纳,德瓦纳加里,埃塞俄比亚,格鲁吉亚,格拉格利亚,哥特式,希腊文,古吉拉特语,古尔姆基,汉,韩文,汉诺,希伯来语,平假名, ,加纳达,片假名,Kayah_Li,Kharoshthi,高棉,老挝,拉丁语,Lepcha,Limbu,Linear_B,Lycian,Lydian,Malayalam,蒙古,缅甸,New_Tai_Lue,Nko,Ogham,Ol_Chiki,Old_Italic,Old_Persian,Oriya,Osmanya,Phags_Pa,Phoenician ,Rejang,符文,Saurashtra,沙维亚语,僧伽罗语,Sundㄧese,Syloti_Nagri,Syriac,菲律宾语,Tagbanwa,Tai_Le,泰米尔语,泰卢固语,泰纳,泰国语,西藏文,Tifinagh,Ugaritic,Vai和Yi。
哇。 Ruby Regexp源。
Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.
class String
def contains_cjk?
...
end
end
>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false
I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.
(ruby 1.9.2)
#encoding: UTF-8
class String
def contains_cjk?
!!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
end
end
strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}
#true
#true
#true
#false
\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.
Wow. Ruby Regexp source .
这篇关于如何在Ruby中检测字符串中的某些Unicode字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!