Iterator :: collect是否分配与String :: with_capacity相同的内存量? [英] Does Iterator::collect allocate the same amount of memory as String::with_capacity?

查看:131
本文介绍了Iterator :: collect是否分配与String :: with_capacity相同的内存量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C ++中,当连接一堆字符串(其中每个元素的大小大致已知)时,通常会预先分配内存以避免多次重新分配和移动:



< pre class = lang-cpp prettyprint-override> std :: vector< std :: string>话;
constexpr size_t APPROX_SIZE = 20;

std :: string短语;
短语.reserve((words.size()+ 5)* APPROX_SIZE); //<-避免对
进行多次分配(const auto& w:words)
statement.append(w);

类似地,我在Rust中做到了(这块需要 unicode-segmentation 板条箱)

  fn反向(输入:& str)->字符串{
let mut result = String :: with_capacity(input.len());
for input中的gc.graphemes(true / * extended * /)。rev(){
result.push_str(gc)
}
结果
}

有人告诉我,惯用的方式是单个表达式

  fn reverse(input:& str)->字符串{
input
.graphemes(true / * extended * /)
.rev()
.collect ::< Vec& str>>()
.concat()
}

虽然我真的很喜欢并想使用从内存分配的角度来看,它会分配比前者更少的块吗?



我用 cargo rustc --release拆开了它---emit asm -C llvm-args = -x86-asm-syntax = intel ,但是没有散布的源代码,所以我很茫然。

解决方案

您的原始代码很好,我不建议更改它。



原始版本分配一次:在 String :: with_capacity 内部。



第二个版本至少分配 两次:首先,它创建一个 Vec<& str> 并通过 push ing & str 。然后,它计算所有& str 的总大小,并创建一个具有正确大小的新 String 。 (有关此代码,请参见 <$ str.rs 中的c $ c> join_generic_copy 方法。)这很糟糕,原因如下:


  1. 显然,它不必要地分配。

  2. 字素簇可以任意大,因此中间的 Vec 不能有效地预先设置大小,它只是从大小1开始并从那里开始增长。

  3. 对于典型的字符串,它分配更大的空间比仅存储最终结果实际所需的小,因为& str 的大小通常为16个字节,而UTF-8字形簇通常要小得多

  4. 在中间的 Vec 上进行迭代以获取最终大小(您可以从原始& str

最重要的是,我什至不会考虑这个版本idiomati c,因为它收集到临时的 Vec 以便遍历它,而不是仅仅收集原始迭代器,就像在早期版本中一样。此版本解决了问题#3,并使问题#4不相关,但不能令人满意地解决问题#2:

  input.graphemes(true) .rev()。collect()

收集 FromIterator 用作 String ,这将尝试使用 size_hint 来自迭代器实现的。但是,正如我前面提到的,扩展的字素簇可以任意长,因此下界不能大于1。更糟糕的是,& str 可能为空。 ,因此 FromIterator<& str> 表示 String 的情况,一无所知结果大小(以字节为单位)。这段代码只是创建一个空的 String 并反复调用 push_str



要明确一点,这还不错! String 有一个增长策略,可以保证摊销O(1)插入,因此,如果您大部分是很小的字符串,则不需要经常重新分配,或者您不相信使用 collect ::< String>()分配成本是一个瓶颈,如果您发现它更易读且更容易推理,则可能是合理的。



让我们回到原始代码。

  let mut result =字符串: :with_capacity(input.len()); 
for input中的gc.graphemes(true).rev(){
result.push_str(gc);
}

这是惯用法。 collect 也是惯用的,但是所有 collect 所做的基本上都是以上内容,但初始容量较不准确。由于 collect 并没有完成您想要的操作,因此亲自编写代码并不是一件容易的事。



有一个更为简洁的迭代器版本,仍然仅分配一次。使用 extend 方法,该方法是 String <的 Extend<& str> 的一部分。 / code>:

  fn reverse(input:& str)->字符串{
let mut result = String :: with_capacity(input.len());
result.extend(input.graphemes(true).rev());
结果
}

我有一种模糊的感觉,即 extend 更好,但是这两种都是编写相同代码的完美习惯。除非您认为表示意图更好,并且您不关心额外的分配,否则不要使用 collect 重写它。 p>

相关




In C++ when joining a bunch of strings (where each element's size is known roughly), it's common to pre-allocate memory to avoid multiple re-allocations and moves:

std::vector<std::string> words;
constexpr size_t APPROX_SIZE = 20;

std::string phrase;
phrase.reserve((words.size() + 5) * APPROX_SIZE);  // <-- avoid multiple allocations
for (const auto &w : words)
  phrase.append(w);

Similarly, I did this in Rust (this chunk needs the unicode-segmentation crate)

fn reverse(input: &str) -> String {
    let mut result = String::with_capacity(input.len());
    for gc in input.graphemes(true /*extended*/).rev() {
        result.push_str(gc)
    }
    result
}

I was told that the idiomatic way of doing it is a single expression

fn reverse(input: &str) -> String {
  input
      .graphemes(true /*extended*/)
      .rev()
      .collect::<Vec<&str>>()
      .concat()
}

While I really like it and want to use it, from a memory allocation point of view, would the former allocate less chunks than the latter?

I disassembled this with cargo rustc --release -- --emit asm -C "llvm-args=-x86-asm-syntax=intel" but it doesn't have source code interspersed, so I'm at a loss.

解决方案

Your original code is fine and I do not recommend changing it.

The original version allocates once: inside String::with_capacity.

The second version allocates at least twice: first, it creates a Vec<&str> and grows it by pushing &strs onto it. Then, it counts the total size of all the &strs and creates a new String with the correct size. (The code for this is in the join_generic_copy method in str.rs.) This is bad for several reasons:

  1. It allocates unnecessarily, obviously.
  2. Grapheme clusters can be arbitrarily large, so the intermediate Vec can't be usefully sized in advance -- it just starts at size 1 and grows from there.
  3. For typical strings, it allocates way more space than would actually be needed just to store the end result, because &str is usually 16 bytes in size while a UTF-8 grapheme cluster is typically much less than that.
  4. It wastes time iterating over the intermediate Vec to get the final size where you could just take it from the original &str.

On top of all this, I wouldn't even consider this version idiomatic, because it collects into a temporary Vec in order to iterate over it, instead of just collecting the original iterator, as you had in an earlier version of your answer. This version fixes problem #3 and makes #4 irrelevant but doesn't satisfactorily address #2:

input.graphemes(true).rev().collect()

collect uses FromIterator for String, which will try to use the lower bound of the size_hint from the Iterator implementation for Graphemes. However, as I mentioned earlier, extended grapheme clusters can be arbitrarily long, so the lower bound can't be any greater than 1. Worse, &strs may be empty, so FromIterator<&str> for String doesn't know anything about the size of the result in bytes. This code just creates an empty String and calls push_str on it repeatedly.

Which, to be clear, is not bad! String has a growth strategy that guarantees amortized O(1) insertion, so if you have mostly tiny strings that won't need to be reallocated often, or you don't believe the cost of allocation is a bottleneck, using collect::<String>() here may be justified if you find it more readable and easier to reason about.

Let's go back to your original code.

let mut result = String::with_capacity(input.len());
for gc in input.graphemes(true).rev() {
    result.push_str(gc);
}

This is idiomatic. collect is also idiomatic, but all collect does is basically the above, with a less accurate initial capacity. Since collect doesn't do what you want, it's not unidiomatic to write the code yourself.

There is a slightly more concise, iterator-y version that still makes only one allocation. Use the extend method, which is part of Extend<&str> for String:

fn reverse(input: &str) -> String {
    let mut result = String::with_capacity(input.len());
    result.extend(input.graphemes(true).rev());
    result
}

I have a vague feeling that extend is nicer, but both of these are perfectly idiomatic ways of writing the same code. You should not rewrite it to use collect, unless you feel that expresses the intent better and you don't care about the extra allocation.

Related

这篇关于Iterator :: collect是否分配与String :: with_capacity相同的内存量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆