使用javascript,如何计数亚洲字符和英语单词的混合 [英] using javascript, how can I count a mix of asian characters and english words

查看:399
本文介绍了使用javascript,如何计数亚洲字符和英语单词的混合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一串混合的亚洲字符(现在,假设只有中文汉字或日本汉字/平假名/片假名)和字母​​数字(即Enlgish,法语),并以下列方式计数: / p>

1)将每个亚洲CHARACTER计为1;
2)将每个字母数字字W计为1;



几个例子:



株式会社myCompany = 4个字符+ 1个字= 5个总
株式会社マイコ= 7个字符






是使用:

  var wordArray = val.split(/ \w + /); 

,然后检查每个元素的内容是否为字母数字(因此计为1)所以取数组长度)。但我不觉得这真的很聪明,文本被计数可能高达10000字,所以不是很快。



想法?

解决方案

不幸的是JavaScript的 RegExp 不支持Unicode字符类; \w 仅适用于ASCII字符(对某些浏览器错误进行模数化)。



,虽然,所以你可以做到,如果你可以隔离每一组字符你感兴趣的范围。例如:

  var r = new RegExp(
'[A-Za-z0-9_\] + |'+ // ASCII字母(无重音)
'[\\\぀ -\\\ゟ] + |'+ //平假名
'[\\\゠- \\\ヿ] + |'+ // Katakana
'[\\\一-\\\鿿\\\豈 -\\\﫿\\\㐀-\\\䶿]',//单个CJK表意文字
'g');

var nwords = str.match(r).length;

(这试图给一个更现实的日语单词数,当然,这还不是正确的,但是它可能比把每个音节视为一个单词更近。)



显然有更多的字符如果你想做得好,就必须考虑。让我们希望你没有基本的多语言飞机以外的角色,一个!


I need to take a string of mixed Asian characters (for now, assume only Chinese kanji or Japanese kanji/hiragana/katakana) and "Alphanumeric" (i.e., Enlgish, French), and count it in the following way:

1) count each Asian CHARACTER as 1; 2) count each Alphanumeric WORD as 1;

a few examples:

株式会社myCompany = 4 chars + 1 word = 5 total 株式会社マイコ = 7 chars


my only idea so far is to use:

var wordArray=val.split(/\w+/);

and then check each element to see if its contents are alphanumeric (so count as 1) or not (so take the array length). But I don't feel that's really very clever at all and the text being counted might be up to 10,000words, so not very quick.

Ideas?

解决方案

Unfortunately JavaScript's RegExp has no support for Unicode character classes; \w only applies to ASCII characters (modulo some browser bugs).

You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:

var r= new RegExp(
    '[A-Za-z0-9_\]+|'+                             // ASCII letters (no accents)
    '[\u3040-\u309F]+|'+                           // Hiragana
    '[\u30A0-\u30FF]+|'+                           // Katakana
    '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]',   // Single CJK ideographs
'g');

var nwords= str.match(r).length;

(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)

Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!

这篇关于使用javascript,如何计数亚洲字符和英语单词的混合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆