为什么在连接两个字符串时 Python 比 C 快? [英] Why is Python faster than C when concatenating two strings?
问题描述
目前我想比较 Python 和 C 用于处理字符串时的速度.我认为 C 应该提供比 Python 更好的性能;然而,我得到了完全相反的结果.
Currently I want to compare the speed of Python and C when they're used to do string stuff. I think C should give better performance than Python will; however, I got a total contrary result.
这是 C 程序:
#include <unistd.h>
#include <sys/time.h>
#define L (100*1024)
char s[L+1024];
char c[2*L+1024];
double time_diff( struct timeval et, struct timeval st )
{
return 1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}
int foo()
{
strcpy(c,s);
strcat(c+L,s);
return 0;
}
int main()
{
struct timeval st;
struct timeval et;
int i;
//printf("s:%x\nc:%x\n", s,c);
//printf("s=%d c=%d\n", strlen(s), strlen(c));
memset(s, '1', L);
//printf("s=%d c=%d\n", strlen(s), strlen(c));
foo();
//printf("s=%d c=%d\n", strlen(s), strlen(c));
//s[1024*100-1]=0;
gettimeofday(&st,NULL);
for( i = 0 ; i < 1000; i++ ) foo();
gettimeofday(&et,NULL);
printf("%f\n", time_diff(et,st));
return 0;
}
这是 Python 的:
and this is the Python one:
import time
s = '1'*102400
def foo():
c = s + s
#assert( len(c) == 204800 )
st = time.time()
for x in xrange(1000):
foo()
et = time.time()
print (et-st)
以及我得到的:
root@xkqeacwf:~/lab/wfaster# python cp100k.py
0.027932882309
root@xkqeacwf:~/lab/wfaster# gcc cp100k.c
root@xkqeacwf:~/lab/wfaster# ./a.out
0.061820
有意义吗?还是我只是犯了一些愚蠢的错误?
Does that make sense? Or am I just making any stupid mistakes?
推荐答案
累积评论(主要来自我)转化为答案:
Accumulated comments (mainly from me) converted into an answer:
- 如果您使用对字符串长度的了解并使用
memmove()
或memcpy()
而不是strcpy()
会发生什么> 和strcat()
?(我注意到strcat()
可以用strcpy()
替换,结果没有区别 - 检查时间可能很有趣.)另外,你没有t 包含
(或
),因此您会错过
的任何优化代码>可能会提供!
- What happens if you use your knowledge of the lengths of the strings and use
memmove()
ormemcpy()
instead ofstrcpy()
andstrcat()
? (I note that thestrcat()
could be replaced withstrcpy()
with no difference in result — it might be interesting to check the timing.) Also, you didn't include<string.h>
(or<stdio.h>
) so you're missing any optimizations that<string.h>
might provide!
Marcus: 是的,memmove()
比 strcpy()
快,也比 Python 快,但为什么呢?memmove()
是否一次进行字宽复制?
Marcus: Yes,
memmove()
is faster thanstrcpy()
and faster than Python, but why? Doesmemmove()
do a word-width copy at a time?
- 是的;在 64 位机器上,为了很好地对齐数据,它可以一次移动 64 位而不是一次移动 8 位;一台 32 位机器,一次可能是 32 位.它还有
only 一个一个更简单的测试来对每次迭代(计数)进行,而不是 ('count or is it null byte') 'is this a null byte'. - Yes; on a 64-bit machine for nicely aligned data, it can be moving 64-bits at a time instead of 8-bits at a time; a 32-bit machine, likely 32-bits at a time. It also has
only onea simpler test to make on each iteration (count), not ('count or is it null byte') 'is this a null byte'. memmove()
的代码是高度优化的汇编程序,可能是内联的(没有函数调用开销,但对于 100KiB 的数据,函数调用开销最小).好处来自更大的移动和更简单的循环条件.- The code for
memmove()
is highly optimized assembler, possibly inline (no function call overhead, though for 100KiB of data, the function call overhead is minimal). The benefits are from the bigger moves and the simpler loop condition. - 我没有查看 Python 源代码,但实际上可以肯定的是,它会跟踪其字符串的长度(它们以 null 结尾,但 Python 总是知道字符串的活动部分有多长).知道长度允许 Python 使用
memmove()
或memcpy()
(区别在于memmove()
即使源和目标重叠;如果它们重叠,memcpy()
不必正常工作).他们不太可能获得比memmove/memcpy
更快的东西. - I've not looked at the Python source, but it is practically a certainty that it keeps track of the length of its strings (they're null terminated, but Python always knows how long the active part of the string is). Knowing that length allows Python to use
memmove()
ormemcpy()
(the difference being thatmemmove()
works correctly even if the source and destination overlap;memcpy()
is not obliged to work correctly if they overlap). It is relatively unlikely that they've got anything faster thanmemmove/memcpy
available.
Marcus: 但是 memmove()
即使在我制作 L=L-13
和 sizeof(s)
给出L+1024-13
.我的机器有一个 sizeof(int)==4
.
Marcus: But
memmove()
is still working well even after I makeL=L-13
, andsizeof(s)
gives outL+1024-13
. My machine has asizeof(int)==4
.
Marcus: 那么 Python 是否也使用 memmove()
或者其他魔法?
Marcus: So does Python use
memmove()
as well, or something magic?
我修改了 C 代码以在我的机器上为我产生更稳定的时序(Mac OS X 10.7.4、8 GiB 1333 MHz RAM、2.3 GHz Intel Core i7、GCC 4.7.1),并比较 strcpy()
和 strcat()
对比 memcpy()
对比 memmove()
.请注意,我将循环计数从 1000 增加到 10000 以提高计时的稳定性,并且我将整个测试(所有三种机制的)重复了 10 次.可以说,计时循环计数应该再增加 5-10 倍,以便计时超过一秒.
I modified the C code to produce more stable timings for me on my machine (Mac OS X 10.7.4, 8 GiB 1333 MHz RAM, 2.3 GHz Intel Core i7, GCC 4.7.1), and to compare strcpy()
and strcat()
vs memcpy()
vs memmove()
. Note that I increased the loop count from 1000 to 10000 to improve the stability of the timings, and I repeat the whole test (of all three mechanisms) 10 times. Arguably, the timing loop count should be increased by another factor of 5-10 so that the timings are over a second.
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#define L (100*1024)
char s[L+1024];
char c[2*L+1024];
static double time_diff( struct timeval et, struct timeval st )
{
return 1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}
static int foo(void)
{
strcpy(c,s);
strcat(c+L,s);
return 0;
}
static int bar(void)
{
memcpy(c + 0, s, L);
memcpy(c + L, s, L);
return 0;
}
static int baz(void)
{
memmove(c + 0, s, L);
memmove(c + L, s, L);
return 0;
}
static void timer(void)
{
struct timeval st;
struct timeval et;
int i;
memset(s, '1', L);
foo();
gettimeofday(&st,NULL);
for( i = 0 ; i < 10000; i++ )
foo();
gettimeofday(&et,NULL);
printf("foo: %f\n", time_diff(et,st));
gettimeofday(&st,NULL);
for( i = 0 ; i < 10000; i++ )
bar();
gettimeofday(&et,NULL);
printf("bar: %f\n", time_diff(et,st));
gettimeofday(&st,NULL);
for( i = 0 ; i < 10000; i++ )
baz();
gettimeofday(&et,NULL);
printf("baz: %f\n", time_diff(et,st));
}
int main(void)
{
for (int i = 0; i < 10; i++)
timer();
return 0;
}
编译时没有警告:
gcc -O3 -g -std=c99 -Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes \
-Wold-style-definition cp100k.c -o cp100k
我得到的时间是:
foo: 1.781506
bar: 0.155201
baz: 0.144501
foo: 1.276882
bar: 0.187883
baz: 0.191538
foo: 1.090962
bar: 0.179188
baz: 0.183671
foo: 1.898331
bar: 0.142374
baz: 0.140329
foo: 1.516326
bar: 0.146018
baz: 0.144458
foo: 1.245074
bar: 0.180004
baz: 0.181697
foo: 1.635782
bar: 0.136308
baz: 0.139375
foo: 1.542530
bar: 0.138344
baz: 0.136546
foo: 1.646373
bar: 0.185739
baz: 0.194672
foo: 1.284208
bar: 0.145161
baz: 0.205196
奇怪的是,如果我放弃无警告"并省略
和
标头,如在原始发布的代码,我得到的时间是:
What is weird is that if I forego 'no warnings' and omit the <string.h>
and <stdio.h>
headers, as in the original posted code, the timings I got are:
foo: 1.432378
bar: 0.123245
baz: 0.120716
foo: 1.149614
bar: 0.186661
baz: 0.204024
foo: 1.529690
bar: 0.104873
baz: 0.105964
foo: 1.356727
bar: 0.150993
baz: 0.135393
foo: 0.945457
bar: 0.173606
baz: 0.170719
foo: 1.768005
bar: 0.136830
baz: 0.124262
foo: 1.457069
bar: 0.130019
baz: 0.126566
foo: 1.084092
bar: 0.173160
baz: 0.189040
foo: 1.742892
bar: 0.120824
baz: 0.124772
foo: 1.465636
bar: 0.136625
baz: 0.139923
仔细观察这些结果,它似乎比更干净"的代码更快,尽管我没有对两组数据运行学生 t 检验,而且时间有很大的可变性(但我确实有一些东西例如 Boinc 在后台运行 8 个进程).这种效果在代码的早期版本中似乎更为明显,当时只测试了 strcpy()
和 strcat()
.如果真的有效果,我没有任何解释!
Eyeballing those results, it seems to be faster than the 'cleaner' code, though I've not run a Student's t-Test on the two sets of data, and the timings have very substantial variability (but I do have things like Boinc running 8 processes in the background). The effect seemed to be more pronounced in the early versions of the code, when it was just strcpy()
and strcat()
that was tested. I have no explanation for that, if it is a real effect!
由 mvds
由于问题已关闭,我无法正确回答.在 Mac 上几乎什么都不做,我得到了这些时间:
Since the question was closed I cannot answer properly. On a Mac doing virtually nothing, I get these timings:
(带标题)
foo: 1.694667 bar: 0.300041 baz: 0.301693
foo: 1.696361 bar: 0.305267 baz: 0.298918
foo: 1.708898 bar: 0.299006 baz: 0.299327
foo: 1.696909 bar: 0.299919 baz: 0.300499
foo: 1.696582 bar: 0.300021 baz: 0.299775
(没有标题,忽略警告)
(without headers, ignoring warnings)
foo: 1.185880 bar: 0.300287 baz: 0.300483
foo: 1.120522 bar: 0.299585 baz: 0.301144
foo: 1.122017 bar: 0.299476 baz: 0.299724
foo: 1.124904 bar: 0.301635 baz: 0.300230
foo: 1.120719 bar: 0.300118 baz: 0.299673
预处理器输出(-E
标志)显示,包含标头会将 strcpy
转换为内置调用,例如:
Preprocessor output (-E
flag) shows that including the headers translates strcpy
into builtin calls like:
((__builtin_object_size (c, 0) != (size_t) -1) ? __builtin___strcpy_chk (c, s, __builtin_object_size (c, 2 > 1)) : __inline_strcpy_chk (c, s));
((__builtin_object_size (c+(100*1024), 0) != (size_t) -1) ? __builtin___strcat_chk (c+(100*1024), s, __builtin_object_size (c+(100*1024), 2 > 1)) : __inline_strcat_chk (c+(100*1024), s));
因此,strcpy 的 libc 版本优于 gcc 内置版本.(使用 gdb
很容易验证 strcpy
上的断点确实不会在 strcpy()
调用上中断,如果包含标题)
So the libc version of strcpy outperforms the gcc builtin. (using gdb
it is easily verified that a breakpoint on strcpy
indeed doesn't break on the strcpy()
call, if the headers are included)
在 Linux(Debian 5.0.9、amd64)上,差异似乎可以忽略不计.生成的程序集(-S
标志)仅在包含所携带的调试信息上有所不同.
On Linux (Debian 5.0.9, amd64), the differences seem to be negligible. The generated assembly (-S
flag) only differs in debugging information carried by the includes.
这篇关于为什么在连接两个字符串时 Python 比 C 快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!