高效附加到可变长度的字符串容器(Golang) [英] Efficient appending to a variable-length container of strings (Golang)

查看:22
本文介绍了高效附加到可变长度的字符串容器(Golang)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

我需要对大日志文件的每一行应用多个正则表达式(比如几个 GB 长),收集非空匹配项并将它们全部放入一个数组中(用于序列化并通过网络发送).

I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for serialization and sending it over the network).

如果对这个问题的回答成立,则切片没有多大帮助:

Slices are not much help if answer to this question holds:

如果 slice 没有足够的容量,append 将需要分配新的内存并将旧的复制过来.对于具有 <1024 个元素的切片,它将增加一倍的容量,对于具有 >1024 个元素的切片,它将增加 1.25 倍.

If the slice does not have sufficient capacity, append will need to allocate new memory and copy the old one over. For slices with <1024 elements, it will double the capacity, for slices with >1024 elements it will increase it by factor 1.25.

由于实际上可能有数十万个正则表达式匹配,我无法真正预测切片的长度/容量.我不能让它太大以防万一"bc这会浪费内存(或者会?如果内存分配器足够聪明,不会分配太多未写入的内存,也许我可以使用巨大的切片容量没有太大的伤害?).

Since there can be literally hundreds of thousands of regex matches, I can't really predict the length / capacity of a slice. I can't make it too big either "just in case" bc this will waste memory (or will it? if memory allocator is smart enough not to allocate too much memory that is not written into, maybe I could use a huge slice capacity without much harm?).

所以我正在考虑以下替代方案:

So I'm thinking about following alternative:

  1. 有一个双向链接的匹配列表(http://golang.org/pkg/container/list/)
  2. 计算它的长度(len() 会起作用吗?)
  3. 预分配此容量的一部分
  4. 复制指向该切片的字符串指针

在 Go 中实现这一目标是否有一种不那么费力的方法(附加 ~ O(1) 附加复杂度)?

Is there a less laborious way of achieving this goal in Go (append with ~ O(1) append complexity)?

(这里当然是golang新手)

(golang newbie here of course)

推荐答案

append()平均(摊销)成本已经是 O(1),因为它在增长每次按百分比排列数组.随着阵列变得越来越大,增长它的成本会越来越高,但相应地也会越来越少.一个 1000 万个项目的切片将比一个 10 万个项目的切片贵 10 倍,但由于我们分配的额外容量与大小成正比,它也将是 10 倍的 append(slice, item) 调用,直到它下一次增长.增加的成本和减少的重新分配频率相互抵消,使平均成本保持不变,即 O(1).

append()'s average (amortized) cost is already O(1) because it grows the array by a percentage each time. As the array gets larger, growing it gets costlier but proportionately rarer. A 10M-item slice will be 10x more expensive to grow than a 1M-item slice, but since the extra capacity we're allocating is proportional to the size, it'll also be 10x as many append(slice, item) calls until the next time it grows. The increasing cost and decreasing frequency of reallocations cancel out, leaving the average cost constant, i.e., O(1).

同样的想法也适用于其他语言的动态大小数组:例如,Microsoft 的 std::vector 实现显然每次都会使数组增长 50%.摊销 O(1) 并不意味着您无需为分配支付任何费用,只是随着您的数组变大,您将继续以相同的平均费率支付费用.

The same idea applies other languages' dynamically-sized arrays, too: Microsoft's std::vector implementation apparently grows the array by 50% each time, for example. Amortized O(1) doesn't mean you pay nothing for allocations, only that you continue paying at the same average rate as your array gets bigger.

在我的笔记本电脑上,我可以在 77 毫秒内运行一百万个 slice = append(slice, someStaticString).其快速的一个原因(下面是 siritinga 指出的)是复制"扩大数组的字符串实际上只是复制字符串标题(指针/长度对),而不是复制内容.100,000 个字符串标头的复制空间仍不足 2MB,与您正在处理的其他数据量相比,这并不是什么大问题.

On my laptop, I could run a million slice = append(slice, someStaticString)s in 77ms. One reason it's quick, which siritinga noted below, is that "copying" the string to enlarge the array is really just copying the string header (a pointer/length pair), not copying the contents. 100,000 string headers is still under 2MB to copy, not a big deal compared to the other quantities of data you're working with.

container/list 在微基准测试中对我来说慢了约 3 倍;链表追加当然也是常数时间,但我想 append 有一个较低的常数,因为它通常可以只写入几个内存字而不分配列表项等.时间代码在 Playground 中不起作用,但您可以在本地复制并运行它来查看自己:http://play.golang.org/p/uYyMScmOjX

container/list turned out ~3x slower for me in a microbenchmark; linked-list append is also constant time, of course, but I imagine append has a lower constant because it can usually just write to a couple words of memory and not allocate a list item, etc. The timing code won't work in the Playground but you can copy this locally and run it to see yourself: http://play.golang.org/p/uYyMScmOjX

有时,您可以有效地预先分配空间以避免重新分配/复制(在此示例中,使用 make([]string, 0, 1000000) 将运行时间从 ~77ms 缩短到 ~10ms),但是,当然,通常只是你没有足够的关于预期数据大小等的信息来勉强获得有价值的收益,你最好把它留给内置算法.

Sometimes, you can usefully pre-allocate space to avoid reallocations/copies (in this example, using make([]string, 0, 1000000) takes the runtime from ~77ms to ~10ms), but, of course, often-to-usually just you don't have enough info about the expected data size and so on to eke out worthwhile gains, and you're better off leaving it to the built-in algorithm.

但是您在这里提出了一个关于类似 grep 的应用程序的更具体的问题(感谢您提出带有上下文的详细问题).为此,最重要的建议是,如果您正在搜索日志,最好完全避免在 RAM 中缓冲整个输出.

But you're asking a more specific question here about a grep-like application (and thanks for asking a detailed question with context). For that, bottom-line recommendation is that if you're searching gigs of logs, it's probably best to avoid buffering the whole output in RAM at all.

您可以编写一些东西以将结果作为单个函数流式传输:logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp);如果您不想要发送结果的代码与 grep 代码过于纠缠.

You could write something to stream the results as a single function: logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp); you could alternatively make out a chan []byte or func(match []byte) (err error) if you don't want the code that sends results to be too enmeshed with the grep code.

(在 []bytestring 上:a []byte 似乎在这里完成了工作并避免了 []byte<=>string 转换时你做 I/O,所以我更喜欢那个.不过,我不知道你在做什么,如果你需要 string 就可以了.)

(On []byte vs. string: a []byte seems to do the job here and avoids []byte<=>string conversions when you do I/O, so I'd prefer that. I don't know what all you're doing, though, and if you need string it's fine.)

如果您这样做将整个匹配列表保存在 RAM 中,请注意保留对大字符串或字节切片的一部分的引用可防止整个源字符串/切片被垃圾收集.因此,如果您走这条路,那么您实际上可能想要复制匹配项以避免将所有源日志数据保留在 RAM 中.

If you do keep the whole match list in RAM, be aware that keeping around a reference to part of a big string or byte slice keeps the whole source string/slice from being garbage collected. So if you go that route, then counterintuitively you may actually want to copy matches to avoid keeping all of the source log data in RAM.

这篇关于高效附加到可变长度的字符串容器(Golang)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆