日语的空白字符是什么? [英] What are all the Japanese whitespace characters?

查看:636
本文介绍了日语的空白字符是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要分割一个字符串并提取用空格字符分隔的单词。来源可能是英语或日语。英文空格字符包括制表符和空格,而日语文本也使用这些字符。 (IIRC,所有广泛使用的日语字符集都是US-ASCII的超集。)

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character sets are supersets of US-ASCII.)

因此,我用来分割字符串的字符集包括正常的ASCII空间。

So the set of characters I need to use to split my string includes normal ASCII space and tab.

但是,在日语中,还有另一个空格字符,通常称为全角空格。根据我的Mac的Character Viewer实用程序,这是U + 3000 IDEOGRAPHIC SPACE。 (通常)当用户在日语输入模式下键入时按下空格键会产生什么结果。

But, in Japanese, there is another space character, commonly called a 'full-width space'. According to my Mac's Character Viewer utility, this is U+3000 "IDEOGRAPHIC SPACE". This is (usually) what results when a user presses the space bar while typing in Japanese input mode.

我还需要考虑其他字符吗?

Are there any other characters that I need to consider?

我正在处理用户被告知要用空格分隔条目的用户提交的文本数据。但是,用户正在使用各种各样的计算机和移动电话操作系统来提交这些文本。我们已经看到,用户在输入此数据时可能不知道他们是日语输入还是英语输入模式。

I am processing textual data submitted by users who have been told to "separate entries with spaces". However, the users are using a wide variety of computer and mobile phone operating systems to submit these texts. We've already seen that users may not be aware of whether they are in Japanese or English input mode when entering this data.

此外,即使在日文模式下,空格键的行为也因平台和应用程序而异(例如,Windows 7将插入表意空格,而iOS将插入ASCII空格)

Furthermore, the behavior of the space key differs across platforms and applications even in Japanese mode (e.g., Windows 7 will insert an ideographic space but iOS will insert an ASCII space).

所以我想要的基本上是所有看起来像空格的字符集,当用户按下空格键或Tab键时可能会生成这些字符集因为许多用户不知道空格和制表符之间的区别,日语和/或英语。

So what I want is basically "the set of all characters that visually look like a space and might be generated when the user presses the space key, or the tab key since many users do not know the difference between a space and a tab, in Japanese and/or English".

该问题是否有权威的答案?

Is there any authoritative answer to such a question?

推荐答案

您需要ASCII标签,空格和不间断空格(U + 00A0),以及全角空格已正确标识为U + 3000。您可能需要换行符和垂直空格字符。如果您输入的是unicode(而不是Shift-JIS等),则只需要这些即可。还有其他(控制)字符,例如\0 NULL有时被用作信息定界符,但它们不会在东亚文本中显示为空格-即,它们不会显示为空白。

You need the ASCII tab, space and non-breaking space (U+00A0), and the full-width space, which you've correctly identified as U+3000. You might possibly want newlines and vertical space characters. If your input is in unicode (not Shift-JIS, etc.) then that's all you'll need. There are other (control) characters such as \0 NULL which are sometimes used as information delimiters, but they won't be rendered as a space in East Asian text - i.e., they won't appear as white-space.

编辑:Matt Ball在他的评论中有一个很好的观点,但是,正如他的例子所示,许多正则表达式实现不能很好地处理全角东亚标点符号。在这方面,值得一提的是Python的 string.whitespace 也不会割芥末。

edit: Matt Ball has a good point in his comment, but, as his example illustrates, many regex implementations don't deal well with full-width East Asian punctuation. In this connection, it's worth mentioning that Python's string.whitespace won't cut the mustard either.

这篇关于日语的空白字符是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆