调用 String#toLowerCase 时应该指定哪个语言环境? [英] Which Locale should I specify when I call String#toLowerCase?

查看:47
本文介绍了调用 String#toLowerCase 时应该指定哪个语言环境?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Java 中,String#toLowerCase 方法使用默认的系统 Locale 来确定如何处理小写.如果我将一些 ASCII 文本小写并希望确保按预期处理,我应该使用哪个语言环境?

我主要关心的是编程标识符,例如模式中的表名和列名.因此,我希望应用英文小写.

Locale.ROOT 声明它是区域设置敏感操作的语言/国家中性区域设置

Locale.ENGLISH 大概也是一个安全的选择.

解决方案

是的,Locale.ENGLISH 是编程语言标识符和 URL 部分等案例操作的安全选择,因为它不涉及任何特殊的大小写规则和英文大小写中的所有 7 位 ASCII 字符 - 转换为 7 位 ASCII 字符.

这不适用于所有其他语言环境.在土耳其语中,I"和i"字符不进行大小写转换.

有点和无点的我"解释道:

<块引用>

土耳其字母表是拉丁字母表的变体,包括字母 I 的两个不同版本,一个带点,另一个不带点.

在 Unicode 中,U+0131 是一个小写的无点 i (ı​​).U+0130 (İ) 是带有点的大写 i.ISO-8859-9 分别将它们放在 0xFD 和 0xDD 位置.在正常的排版中,当小写 i 与其他变音符号组合时,通常在添加变音符号之前删除点;然而,Unicode 仍然列出了包括点 i 在内的等效组合序列,因为从逻辑上讲,它是被修改的普通点 i 字符.

大多数 Unicode 软件将大写 ı 转换为 I 并将小写 İ 转换为 i,但是,除非专门为土耳其语设置,否则它将小写 I 转换为 i 并将大写 i 转换为 I.因此,大写然后小写,反之亦然,会更改字母.

特殊例外列表保存在 http://unicode.org/Public/UNIDATA/SpecialCasing.txt

<块引用>

# =================================================================================# 土耳其语和阿塞拜疆语# i 和 i-dotless;I-dot 和 i 是土耳其语和阿塞拜疆语的大小写对# 以下规则处理这些情况.0130;0069;0130;0130;tr;# 上面带点的拉丁文大写字母 I0130;0069;0130;0130;az;# 上面带点的拉丁文大写字母 I# 小写时,去掉序列i + dot_above中的dot_above,变成i.# 这与规范等效的 I-dot_above 的行为相匹配0307;;0307;0307;tr After_I;# 结合上面的点0307;;0307;0307;az After_I;# 结合上面的点

...

In Java the String#toLowerCase method uses the default system Locale to determine how to handle lowercasing. If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?

EDIT: I'm mainly concerned about programming identifiers such as table and column names in a schema. As such I want English lower casing to apply.

Locale.ROOT states that it is the language/country neutral locale for the locale sensitive operations

Locale.ENGLISH would presumably also be a safe choice.

解决方案

Yes, Locale.ENGLISH is a safe choice for case operations for things like programming language identifiers and URL parts since it doesn't involve any special casing rules and all 7-bit ASCII characters in the ENGLISH case-convert to 7-bit ASCII characters.

That is not true for all other locales. In Turkish, the 'I' and 'i' characters are not case-converted to one another.

"Dotted and dotless I" explains:

The Turkish alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.

In Unicode, U+0131 is a lower case letter dotless i (ı). U+0130 (İ) is capital i with dot. ISO-8859-9 has them at positions 0xFD and 0xDD respectively. In normal typography, when lower case i is combined with other diacritics, the dot is generally removed before the diacritic is added; however, Unicode still lists the equivalent combining sequences as including the dotted i, since logically it is the normal dotted i character that is being modified.

Most Unicode software uppercases ı to I and lowercases İ to i, but, unless specifically set up for Turkish, it lowercases I to i and uppercases i to I. Thus uppercasing then lowercasing, or vice versa, changes the letters.

The list of special exceptions is maintained at http://unicode.org/Public/UNIDATA/SpecialCasing.txt

# ================================================================================

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

...

这篇关于调用 String#toLowerCase 时应该指定哪个语言环境?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆