参数扩展缓慢的大型数据集 [英] Parameter expansion slow for large data sets

查看:109
本文介绍了参数扩展缓慢的大型数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我采取的第一个1,000字节从一个文件,Bash可以快速更换一些字符pretty

If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick

$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2

$ read aa < <(cut -b-1000 get_video_info)

$ time set "${aa//%/\x}"

real    0m0.025s
user    0m0.031s
sys     0m0.000s

但是,如果我拿10,000字节它极大地减慢

However if I take 10,000 bytes it slows dramatically

$ read aa < <(cut -b-10000 get_video_info)

$ time set "${aa//%/\x}"

real    0m8.125s
user    0m8.127s
sys     0m0.000s

我看了格雷格Wooledge的帖子但它缺乏一个解释,为什么猛砸参数扩展很慢

I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.

推荐答案

对于为什么的,你可以看到这个code在实施 pat_subst 在bash的源$ C ​​$ C subst.c

For the why, you can see the implementation of this code in pat_subst in subst.c in the bash source code.

有关字符串中的每个匹配字符串的长度计算无数次( pat_subst match_pattern match_upattern ),无论是作为一个C字符串,更昂贵的多字节字符串。这使得功能既比必要要慢,并且更重要的是,二次在复杂性。

For each match in the string, the length of the string is counted numerous times (in pat_subst, match_pattern and match_upattern), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.

这就是为什么它是较大的输入速度慢,这里有一个pretty图:

This is why it's slow for larger input, and here's a pretty graph:

至于解决方法,只要使用 SED 。它更可能的字符串替换操作进行优化(虽然你应该知道,只有POSIX保证每行8192字节,即使GNU sed的处理任意大的)。

As for workarounds, just use sed. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).

这篇关于参数扩展缓慢的大型数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆