如何获取Javascript中日语字符的长度? [英] How to get the length of Japanese characters in Javascript?

查看:205
本文介绍了如何获取Javascript中日语字符的长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有SHIFT_JIS字符集的ASP经典页面.页面头部下方的meta标签是这样的:

<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">

我的页面上有一个文本框(txtName),只能容纳200个字符.我有一个验证字符长度的Javascript函数,该函数在我的Submit按钮的onclick()事件上调用.

if(document.frmPage.txtName.value.length > 200) {
  alert("You have exceeded the maximum length of 200.");
  return false;
}

问题是,Javascript无法获得以SHIFT_JIS编码的日语字符的正确长度.例如,字符测距的SHIFT_JIS长度为8个字符,但是Javascript只将其识别为一个字符,这可能是由于Javascript默认使用的Unicode编码.在SHIFT_JIS中,诸如ケ之类的某些字符具有2或3个字符.

如果仅依靠Javascript提供的长度,则日语长字符将通过页面验证,并且它将尝试保存在数据库中,由于DB列的最大长度为200,因此它将失败.

我使用的浏览器是Internet Explorer.有没有办法使用Javascript获得日文字符的SHIFT_JIS长度?是否可以使用Javascript将Unicode转换为SHIFT_JIS?如何?

感谢您的帮助!

解决方案

例如,字符测距的SHIFT_JIS长度为8个字符,但是Javascript只能将其识别为一个字符,这可能是因为Unicode编码

我们要明确:测距,U + 6D4B(汉字度量,估计,猜想")是单个字符.当您将其编码为Shift-JIS之类的特定编码时,它很可能会变成多个 bytes .

通常,JavaScript不会提供编码表,因此您无法找出一个字符将占用多少字节.如果确实需要,则必须携带足够的数据来自己计算.例如,如果您假设输入仅包含在Shift-JIS中有效的字符,则此函数将保留所有作为单个字节的字符的列表,并假设其他所有字符都占用一个字节,从而算出需要多少个字节两个字节:

function getShiftJISByteLength(s) {
    return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}

但是,在Shift-JIS中没有8字节序列,并且在Shift-JIS中根本没有字符测距. (这是日本不使用的汉字.)

为什么您会认为它由8个字节组成的序列是这样的:当浏览器无法提交格式的字符时,由于目标字符集中不存在该字符,它将用HTML字符引用替换它:在这种情况下,为&#27979;.这是一种有损的操作:您无法分辨用户是按字面输入还是&#27979;.而且,如果将提交的内容&#27979;显示为,则意味着您忘记了对输出进行HTML编码,这可能意味着您的应用程序很容易受到跨站点脚本的攻击.<​​/p>

唯一明智的答案是使用UTF-8而不是Shift-JIS. UTF-8可以愉快地对测距或任何其他字符进行编码,而不必求助于损坏的HTML字符引用.如果您需要在数据库中存储受编码字节长度限制的内容,则可以使用一个偷偷摸摸的技巧来获取字符串中UTF-8字节的数量:

function getUTF8ByteLength(s) {
    return unescape(encodeURIComponent(s)).length;
}

尽管可能最好将本地Unicode字符串存储在数据库中,以使长度限制指向实际字符,而不是某些编码中的字节.

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:

<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">

My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.

if(document.frmPage.txtName.value.length > 200) {
  alert("You have exceeded the maximum length of 200.");
  return false;
}

The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.

If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.

The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?

Thanks for the help!

解决方案

For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding

Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.

In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:

function getShiftJISByteLength(s) {
    return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}

However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)

Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case &#27979;. This is a lossy mangling: you can't tell whether the user typed literally or &#27979;. And if you are displaying the submitted content &#27979; as then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.

The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:

function getUTF8ByteLength(s) {
    return unescape(encodeURIComponent(s)).length;
}

although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.

这篇关于如何获取Javascript中日语字符的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆