调用String#toLowerCase时应指定哪个语言环境? [英] Which Locale should I specify when I call String#toLowerCase?

查看:334
本文介绍了调用String#toLowerCase时应指定哪个语言环境?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Java中,String#toLowerCase方法使用默认系统Locale来确定如何处理小写字母.如果我要改写一些ASCII文本,并希望确保按预期方式处理此文本,应该使用哪种语言环境?

In Java the String#toLowerCase method uses the default system Locale to determine how to handle lowercasing. If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?

我主要关心编程标识符,例如模式中的表名和列名.因此,我希望使用英文的小写字母.

I'm mainly concerned about programming identifiers such as table and column names in a schema. As such I want English lower casing to apply.

Locale.ROOT指出它是区域设置敏感操作的语言/国家/地区中性区域设置

Locale.ROOT states that it is the language/country neutral locale for the locale sensitive operations

Locale.ENGLISH大概也是一个安全的选择.

Locale.ENGLISH would presumably also be a safe choice.

推荐答案

是的,对于编程语言标识符和URL部件之类的案例操作,Locale.ENGLISH是一个安全的选择,因为它不涉及任何特殊的大小写规则,英语大小写的7位ASCII字符将转换为7位ASCII字符.

Yes, Locale.ENGLISH is a safe choice for case operations for things like programming language identifiers and URL parts since it doesn't involve any special casing rules and all 7-bit ASCII characters in the ENGLISH case-convert to 7-bit ASCII characters.

并非所有其他语言环境都如此.在土耳其语中,"I"和"i"字符不会大小写转换.

That is not true for all other locales. In Turkish, the 'I' and 'i' characters are not case-converted to one another.

虚线无点I" 解释:

土耳其字母是拉丁字母的一种变体,它包含字母I的两个不同版本,一个是点缀的,另一个是无点缀的.

The Turkish alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.

在Unicode中,U + 0131是小写字母无点i(ı). U + 0130(İ)是带点的大写i. ISO-8859-9在位置0xFD和0xDD分别具有它们.在普通的排版中,当小写字母i与其他变音符号结合使用时,通常会在添加变音符号之前将点删除;反之,但是,Unicode仍然列出了包括点号i的等效组合序列,因为从逻辑上说,它是被修改的普通点号i字符.

In Unicode, U+0131 is a lower case letter dotless i (ı). U+0130 (İ) is capital i with dot. ISO-8859-9 has them at positions 0xFD and 0xDD respectively. In normal typography, when lower case i is combined with other diacritics, the dot is generally removed before the diacritic is added; however, Unicode still lists the equivalent combining sequences as including the dotted i, since logically it is the normal dotted i character that is being modified.

大多数Unicode软件将ı到I大写,将İ到i小写,但是,除非专门为土耳其语设置,否则它将I到i小写,将i到I大写.因此,大写然后小写,反之亦然. /p>

Most Unicode software uppercases ı to I and lowercases İ to i, but, unless specifically set up for Turkish, it lowercases I to i and uppercases i to I. Thus uppercasing then lowercasing, or vice versa, changes the letters.

特殊异常列表保存在 http://unicode.org/Public/UNIDATA/SpecialCasing.txt

# ================================================================================

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

...

这篇关于调用String#toLowerCase时应指定哪个语言环境?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆