参数扩展缓慢的大型数据集 [英] Parameter expansion slow for large data sets
问题描述
如果我采取的第一个1,000字节从一个文件,Bash可以快速更换一些字符pretty
If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick
$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2
$ read aa < <(cut -b-1000 get_video_info)
$ time set "${aa//%/\x}"
real 0m0.025s
user 0m0.031s
sys 0m0.000s
但是,如果我拿10,000字节它极大地减慢
However if I take 10,000 bytes it slows dramatically
$ read aa < <(cut -b-10000 get_video_info)
$ time set "${aa//%/\x}"
real 0m8.125s
user 0m8.127s
sys 0m0.000s
我看了格雷格Wooledge的帖子但它缺乏一个解释,为什么猛砸参数扩展很慢
I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.
推荐答案
对于为什么的,你可以看到这个code在实施 pat_subst
在在bash的源$ C $ C subst.c
。
For the why, you can see the implementation of this code in pat_subst
in subst.c
in the bash source code.
有关字符串中的每个匹配字符串的长度计算无数次( pat_subst
, match_pattern
和 match_upattern
),无论是作为一个C字符串,更昂贵的多字节字符串。这使得功能既比必要要慢,并且更重要的是,二次在复杂性。
For each match in the string, the length of the string is counted numerous times (in pat_subst
, match_pattern
and match_upattern
), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.
这就是为什么它是较大的输入速度慢,这里有一个pretty图:
This is why it's slow for larger input, and here's a pretty graph:
至于解决方法,只要使用 SED
。它更可能的字符串替换操作进行优化(虽然你应该知道,只有POSIX保证每行8192字节,即使GNU sed的处理任意大的)。
As for workarounds, just use sed
. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).
这篇关于参数扩展缓慢的大型数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!