为什么在 Rust 中将字符串的第一个字母大写如此复杂? [英] Why is capitalizing the first letter of a string so convoluted in Rust?

查看:157
本文介绍了为什么在 Rust 中将字符串的第一个字母大写如此复杂?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将 &str 的第一个字母大写.这是一个简单的问题,我希望有一个简单的解决方案.直觉告诉我做这样的事情:

I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:

let mut s = "foobar";
s[0] = s[0].to_uppercase();

但是 &str 不能像这样编入索引.我能够做到的唯一方法似乎过于复杂.我将 &str 转换为迭代器,将迭代器转换为向量,将向量中的第一项大写,这将创建一个迭代器,我对其进行索引,创建一个 Option,我打开它给我大写的第一个字母.然后我将向量转换为迭代器,将其转换为 String,然后将其转换为 &str.

But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.

let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;

还有比这更简单的方法吗?如果有,怎么办?如果不是,为什么 Rust 设计成这样?

Is there an easier way than this, and if so, what? If not, why is Rust designed this way?

类似问题

推荐答案

为什么这么复杂?

让我们逐行分解它

Why is it so convoluted?

Let's break it down, line-by-line

let s1 = "foobar";

我们创建了一个以 UTF-8 编码的文字字符串.UTF-8 允许我们对 代码点进行编码en.wikipedia.org/wiki/Unicode" rel="noreferrer">Unicode 如果您来自世界上主要输入 ASCII,1963 年创建的标准.UTF-8 是一种可变长度编码,这意味着单个代码点可能占用 1 到 4 个字节.较短的编码是为 ASCII 保留的,但许多汉字在 UTF-8 中占用 3 个字节.

We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.

let mut v: Vec<char> = s1.chars().collect();

这将创建一个 characters 向量.字符是直接映射到代码点的 32 位数字.如果我们从纯 ASCII 文本开始,我们的内存需求就翻了两番.如果我们有一堆来自星光层的字符,那么也许我们还没有用得更多.

This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.

v[0] = v[0].to_uppercase().nth(0).unwrap();

这会获取第一个代码点并请求将其转换为大写变体.不幸的是,对于我们这些说英语长大的人来说,并不总是一个简单的小写字母"到大写字母"的一对一映射.旁注:我们称它们为大写和小写 因为一盒字母位于另一盒当天回信.

This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.

当代码点没有相应的大写变体时,此代码将发生混乱.我不确定这些是否存在,实际上.当代码点具有包含多个字符的大写变体(例如德语 ß)时,它也可能在语义上失败.请注意,ß 可能永远不会在现实世界中真正大写,这是我永远记得和搜索的唯一示例.截至 2017-06-29,事实上,德语拼写的官方规则已经更新,使得 ẞ"和SS"都是有效的大写

This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!

let s2: String = v.into_iter().collect();

这里我们将字符转换回 UTF-8 并需要一个新的分配来存储它们,因为原始变量存储在常量内存中,以便在运行时不占用内存.

Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.

let s3 = &s2;

现在我们引用那个String.

很简单的问题

不幸的是,事实并非如此.也许我们应该努力将世界转换为世界语?

Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?

我认为 char::to_uppercase 已经正确处理了 Unicode.

I presume char::to_uppercase already properly handles Unicode.

是的,我当然希望如此.不幸的是,Unicode 在所有情况下都不够.感谢 huon 指出土耳其语 I,其中两个上部 (İ) 和小写 (i) 版本有一个点.也就是说,没有 one 字母 i 的正确大写;这也取决于源文本的区域设置.

Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases. Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.

为什么需要所有数据类型转换?

why the need for all data type conversions?

因为当您担心正确性和性能时,您使用的数据类型很重要.char 是 32 位,字符串是 UTF-8 编码.它们是不同的东西.

Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.

索引可以返回一个多字节的 Unicode 字符

indexing could return a multi-byte, Unicode character

此处可能存在一些不匹配的术语.char 一个多字节的 Unicode 字符.

There may be some mismatched terminology here. A char is a multi-byte Unicode character.

分割一个字符串是可能的,如果你是一个字节一个字节的,但是如果你不在字符边界上,标准库就会崩溃.

Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.

从未实现对字符串进行索引以获取字符的原因之一是因为太多人将字符串误用为 ASCII 字符数组.索引一个字符串以设置一个字符永远不会有效率——你必须能够用一个也是 1-4 个字节的值替换 1-4 个字节,导致字符串的其余部分经常弹跳.

One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.

to_uppercase 可以返回大写字符

如上所述,ß 是单个字符,大写时会变成两个字符.

As mentioned above, ß is a single character that, when capitalized, becomes two characters.

另请参阅 trentcl 的答案,其中仅大写 ASCII 字符.

See also trentcl's answer which only uppercases ASCII characters.

如果我必须编写代码,它看起来像:

If I had to write the code, it'd look like:

fn some_kind_of_uppercase_first_letter(s: &str) -> String {
    let mut c = s.chars();
    match c.next() {
        None => String::new(),
        Some(f) => f.to_uppercase().chain(c).collect(),
    }
}

fn main() {
    println!("{}", some_kind_of_uppercase_first_letter("joe"));
    println!("{}", some_kind_of_uppercase_first_letter("jill"));
    println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
    println!("{}", some_kind_of_uppercase_first_letter("ß"));
}

但我可能会搜索 uppercaseunicode 在 crates.io 上,让比我更聪明的人来处理它.

But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.

说到比我聪明的人",Veedrac 指出 在访问第一个大写代码点后将迭代器转换回切片可能更有效.这允许剩余字节的 memcpy.

Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.

fn some_kind_of_uppercase_first_letter(s: &str) -> String {
    let mut c = s.chars();
    match c.next() {
        None => String::new(),
        Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
    }
}

这篇关于为什么在 Rust 中将字符串的第一个字母大写如此复杂?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆