如何在ruby中为utf8使用正则表达式 [英] How to use regex for utf8 in ruby

查看:77
本文介绍了如何在ruby中为utf8使用正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在RoR中,如何使用utf8代码验证发布表格的中文或日语单词.

In RoR,how to validate a Chinese or a Japanese word for a posting form with utf8 code.

在GBK代码中,它使用[\ u4e00- \ u9fa5] +来验证中文单词. 在Php中,它对utf-8页面使用/^ [\ x {4e00}-\ x {9fa5}] + $/u.

In GBK code, it uses [\u4e00-\u9fa5]+ to validate Chinese words. In Php, it uses /^[\x{4e00}-\x{9fa5}]+$/u for utf-8 pages.

推荐答案

Ruby 1.8对UTF-8字符串的支持不佳.您需要在正则表达式中单独写入字节,而不是完整的代码:

Ruby 1.8 has poor support for UTF-8 strings. You need to write the bytes individually in the regular expression, rather then the full code:

>> "acentuação".scan(/\xC3\xA7/)
=> ["ç"]    

要匹配您指定的范围,表达式将变得有点复杂:

To match the range you specified the expression will become a bit complicated:

/([\x4E-\x9E][\x00-\xFF])|(\x9F[\x00-\xA5])/  # (untested)

如注释中所述,unicode字符\ u4E00- \ u9FA5仅以UTF16-BE编码映射到上述表达式. UTF8编码可能有所不同.因此,您需要仔细分析映射,看看是否可以针对Ruby 1.8提出一个字节匹配的表达式.

As noted in the comments, the unicode characters \u4E00-\u9FA5 only map to the expression above in the UTF16-BE encoding. The UTF8 encoding is likely different. So you need to analyze the mapping carefully and see if you can come up with a byte-matching expression for Ruby 1.8.

这篇关于如何在ruby中为utf8使用正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆