使用JavaScript检查字符串是否包含日文字符(包括汉字) [英] Using JavaScript to check whether a string contains Japanese characters (including kanji)

查看:212
本文介绍了使用JavaScript检查字符串是否包含日文字符(包括汉字)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何检查给定字符串是否包含一个或多个日文字符(由假名和/或汉字组成)?



我在这里看到了类似的问题:< a href =https://stackoverflow.com/questions/11206443/how-can-i-check-if-variable-contains-chinese-japanese-characters>如何检查变量是否包含中文/日文字符?,我用解决方案来解决这个问题:

  var containsJapanese = string.match(/ [\ U3400-\\\龿] /); 

然而,这会产生许多误报。



<我已经通过让脚本遍历整个网页的内容(例如Facebook,Stack Overflow等)来测试它,并标记了据称包含日文文本的div。在这种情况下,大量的div最终会被错误标记。我还在包含日文文本的页面上对它进行了测试,并且那里的日语div最终被正确标记,并且标记了许多错误标记的div。

解决方案

检查这是否有效。我发现这个网站似乎列出了Unicode中可能包含的所有字符在日文文本中使用。



相应的正则表达式(单个字符)将是:

  / [\\\ -\\\〿\\\぀-\\\ゟ\\\゠-\\\ヿ\\\＀-\\\゚\\\一-\\\龯\\\㐀-\\\䶿 ] / 
-------------_____________ ------------- _____________ ------------- _____________
标点符号平假名片假名全宽CJK CJK分机A
罗马/(普通&(稀有)
半宽罕见)
片假名

范围是(从网站引用):




  • 3000 - 303f :日式标点符号

  • 3040 - 309f :平假名

  • 30a0 - 30ff :片假名

  • ff00 - ff9f :全宽罗马字符和半角片假名

  • 4e00 - 9faf :CJK统一表意文字 - 普通和不常见的汉字

  • 3400 - 4dbf :CJK统一表意文字扩展A - 罕见汉字



我已经稍微更改了范围:




  • 我已经从更改了ff00 - ffef ff00 - ff9f 用于全角罗马字符和半角片假名。来自 ffa0 - ffdc 的代码点包含Hangul半角字符,这不是您想要的。您可能需要重新添加 ffe0 - ffef 中的代码点,但它们大多是半角标点符号或全角货币符号。



您可以检查网站并取消您不想要的任何范围,或确保它不会出现在您的输入中。


How can I check whether a given string contains one or more Japanese characters (consisting of kana and/or kanji)?

I saw a similar question here: How can I check if variable contains Chinese/Japanese characters? , and I used the solution to come up with this:

var containsJapanese = string.match(/[\u3400-\u9FBF]/);

However, this gives many false positives.

I've tested it by having a script iterate through the contents of entire web pages-- such as Facebook, Stack Overflow, etc.-- and marking the divs which supposedly contain Japanese text. In those cases, a large number of divs end up getting marked by mistake. I've also tested it on pages that do contain Japanese text, and the Japanese divs there end up getting marked correctly alongside many incorrectly-marked divs.

解决方案

Check whether this works or not. I found this website that seems to list all the characters in Unicode that might be used in Japanese text.

The corresponding regex (for single character) would be:

/[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]/
  -------------_____________-------------_____________-------------_____________
   Punctuation   Hiragana     Katakana    Full-width       CJK      CJK Ext. A
                                            Roman/      (Common &      (Rare)    
                                          Half-width    Uncommon)
                                           Katakana

The ranges are (as quoted from the site):

  • 3000 - 303f: Japanese-style punctuation
  • 3040 - 309f: Hiragana
  • 30a0 - 30ff: Katakana
  • ff00 - ff9f: Full-width Roman characters and half-width Katakana
  • 4e00 - 9faf: CJK unified ideographs - Common and uncommon Kanji
  • 3400 - 4dbf: CJK unified ideographs Extension A - Rare Kanji

I have changed the ranges a bit:

  • I have changed from ff00 - ffef to ff00 - ff9f for Full-width Roman characters and half-width Katakana. The code points from ffa0 - ffdc contains Hangul half-width characters, which is not what you want. You may want to re-add the code points from ffe0 - ffef, but they are mostly half-width punctuations or full-width currency symbols.

You can check the site and take off any range you don't want, or are sure that it will not appear in your input.

这篇关于使用JavaScript检查字符串是否包含日文字符(包括汉字)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆