为什么在python中字符串比较这么快? [英] Why is string comparison so fast in python?

查看:163
本文介绍了为什么在python中字符串比较这么快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我解决以下示例算法问题时,我很想了解字符串比较在python中如何工作的内部原理:

给出两个字符串,返回最长的公共前缀的长度

解决方案1:charByChar

我的直觉告诉我,最佳解决方案是从两个单词的开头都使用一个光标开始,然后向前迭代直到前缀不再匹配为止.像

 def charByChar(smaller, bigger):
  assert len(smaller) <= len(bigger)
  for p in range(len(smaller)):
    if smaller[p] != bigger[p]:
      return p
  return len(smaller)
 

为简化代码,该函数假定第一个字符串smaller的长度始终小于或等于第二个字符串bigger的长​​度.

解决方案2:binarySearch

另一种方法是将两个字符串一分为二以创建两个前缀子字符串.如果前缀相等,则我们知道公共前缀点至少与中点一样长.否则,公共前缀点至少不大于中点.然后,我们可以递归查找前缀长度.

又名二进制搜索.

 def binarySearch(smaller, bigger):
  assert len(smaller) <= len(bigger)
  lo = 0
  hi = len(smaller)

  # binary search for prefix
  while lo < hi:
    # +1 for even lengths
    mid = ((hi - lo + 1) // 2) + lo

    if smaller[:mid] == bigger[:mid]:
      # prefixes equal
      lo = mid
    else:
      # prefixes not equal
      hi = mid - 1

  return lo
 

起初,我认为binarySearch会比较慢,因为字符串比较会将所有字符进行多次比较,而不是像charByChar中那样仅对前缀字符进行比较.

令人惊讶的是,在进行一些初步基准测试后,binarySearch变得更快.

图A

上面显示了如何随着前缀长度的增加而影响性能.后缀长度保持恒定为50个字符.

此图显示了两件事:

    如预期的那样,这两种算法都会随着前缀长度的增加而线性地变差.
  1. charByChar的性能以更快的速度降级.

为什么binarySearch这么好?我认为是因为

  1. binarySearch中的字符串比较大概是由后台的解释器/CPU优化的.
  2. charByChar实际上为访问的每个字符创建新字符串,这会产生大量开销.

为验证这一点,我对比较和切片字符串(分别在下面分别标记为cmpslice)的性能进行了基准测试.

图B

此图显示了两个重要的内容:

    如预期的那样,比较和切片随长度线性增加.
  1. 比较和切片的成本相对于算法性能而言,随着长度的增加而非常缓慢地增加,图A.请注意,两个数字都达到长度为10亿个字符的字符串.因此,比较10亿个字符的10个字符的成本比一次比较10亿个字符的成本大得多.但这仍然不能回答为什么...

Cpython

为了找出cpython解释器如何优化字符串比较,我为以下函数生成了字节码.

 In [9]: def slice_cmp(a, b): return a[0] == b[0]

In [10]: dis.dis(slice_cmp)
            0 LOAD_FAST                0 (a)
            2 LOAD_CONST               1 (0)
            4 BINARY_SUBSCR
            6 LOAD_FAST                1 (b)
            8 LOAD_CONST               1 (0)
           10 BINARY_SUBSCR
           12 COMPARE_OP               2 (==)
           14 RETURN_VALUE
 

我在cpython代码中四处寻找,发现了以下两个 代码段,但我不是确保这是发生字符串比较的地方.

问题

  • 在cpython中的哪里进行字符串比较?
  • 是否有CPU优化?是否有特殊的x86指令可以进行字符串比较?我如何查看cpython生成了哪些汇编指令?您可能会假设我正在使用最新的python3,Intel Core i5,OS X 10.11.6.
  • 为什么比较长字符串比比较每个字符快得多?


奖金问题:什么时候charByChar表现更好?

如果前缀与字符串的其余长度相比足够小,则在某些时候在charByChar中创建子字符串的开销将小于在binarySearch中比较子字符串的开销.

为了描述这种关系,我深入研究了运行时分析.

运行时分析

为简化以下方程式,我们假设smallerbigger的大小相同,我将它们称为s1s2.

charByChar

charByChar(s1, s2) = costOfOneChar * prefixLen

哪里

costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)

其中cmp(1)是比较两个长度为1个字符的字符串的成本.

slice是访问字符的成本,等同于charAt(i). Python具有不变的字符串,访问char实际上会创建一个新的长度为1的字符串.slice(string_len, slice_len)是将长度为string_len的字符串切成大小为slice_len的切片的代价.

所以

charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)

binarySearch

binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)

log_2是将字符串分成两半直到达到长度为1的字符串的次数.其中

costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)

所以binarySearch的big-O将根据

增长

binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))

根据我们之前对

的费用进行的分析

如果我们假设costOfHalfOfEachString近似于costOfComparingOneChar,那么我们可以将它们都称为x.

charByChar(s1, s2) = O(x * prefixLen)
binarySearch(s1, s2) = O(x * log_2(s1Len))

如果我们将它们等同

O(charByChar(s1, s2)) = O(binarySearch(s1, s2))
x * prefixLen = x * log_2(s1Len)
prefixLen = log_2(s1Len)
2 ** prefixLen = s1Len

所以O(charByChar(s1, s2)) > O(binarySearch(s1, s2)

2 ** prefixLen = s1Len

因此,插入上面的公式后,我重新生成了图A的测试,但字符串的总长度为2 ** prefixLen,因此期望两种算法的性能大致相等.

但是,显然charByChar的性能要好得多.经过反复试验,当s1Len = 200 * prefixLen

时,两种算法的性能大致相等.

为什么关系是200倍?

解决方案

TL:DR :切片比较是一些Python开销+高度优化的memcmp(除非存在UTF-8处理) ?).理想情况下,使用分片比较来发现第一个不匹配的字节数小于128个字节或不到一个字节,然后一次循环一个char.

或者,如果这是一个选项并且问题很重要,请制作经过asm优化的memcmp的修改副本,该副本返回第一个差异的位置,而不是等于/不等于的位置;它的运行速度将与整个字符串中的单个==一样快. Python提供了在库中调用本机C/asm函数的方法.

这是一个令人沮丧的限制,CPU可以如此快速地执行此操作,但是Python(AFAIK)不允许您访问优化的比较循环,该循环告诉您不匹配的位置,而不是仅仅等于/大于/小于.


在CPython的简单Python循环中,解释器开销支配着实际工作的成本是完全正常的.用优化的构建块构建算法是值得的,即使这意味着要做更多的总工作也是如此.这就是NumPy很好的原因,但是逐个元素地循环遍历矩阵是很糟糕的.对于CPython与一个用于一次比较一个字节的已编译C(asm)循环(由数字组成,但可能在一个数量级之内),速度差异可能约为20到100. >

比较内存块的相等性可能是Python循环与在整个列表/切片上进行操作之间最大的不匹配之一.高度优化的解决方案是一个常见问题(例如,大多数libc实现(包括OS X)都具有手动矢量化的手动编码的asm memcmp,它使用SIMD并行比较16或32个字节,并运行 much 比C或汇编语言中的每次字节循环更快.因此,还有另外一个16到32的因数(如果内存带宽不是瓶颈)乘以Python和C循环之间20到100的速度差.或者根据您的memcmp的优化程度,每个周期可能仅" 6或8个字节.

在中型缓冲区的L2或L1d缓存中有热数据的情况下,合理地期望Haswell或更高版本的CPU上的memcmp每个周期16或32个字节. (i3/i5/i7的命名从Nehalem开始;仅i5不足以告诉我们有关您的CPU的更多信息.)

我不确定您的比较中的一个或两个是否必须处理UTF-8并检查等效性类或编码相同字符的不同方法.最坏的情况是,如果您的Python每次一次char循环必须检查潜在的多字节字符,但切片比较只能使用memcmp.


使用Python编写高效的版本:

我们只是在与语言进行斗争以提高效率:您的问题与C标准库函数memcmp几乎完全相同,只是您希望第一个差异的 position -/0/+的结果告诉您哪个字符串更大.搜索循环是相同的,只是查找结果后函数的功能有所不同.

您的二进制搜索不是使用快速比较构件的最佳方法.切片比较仍然具有O(n)成本,而不是O(1) ,只是常数因子小得多.您可以并且应该避免通过使用切片比较大块直到发现不匹配,然后再返回最后一个具有较小块大小的块来重复重新比较缓冲区的开始.

# I don't actually know Python; consider this pseudo-code
# or leave an edit if I got this wrong :P
chunksize = min(8192, len(smaller))
# possibly round chunksize down to the next lowest power of 2?
start = 0
while start+chunksize < len(smaller):
    if smaller[start:start+chunksize] == bigger[start:start+chunksize]:
        start += chunksize
    else:
        if chunksize <= 128:
            return char_at_a_time(smaller[start:start+chunksize],  bigger[start:start+chunksize])
        else:
            chunksize /= 8        # from the same start

# TODO: verify this logic for corner cases like string length not a power of 2
# and/or a difference only in the last character: make sure it does check to the end

我选择8192是因为您的CPU具有32kiB的L1d缓存,所以两个8k片的总缓存占用量为16k,是您的L1d的一半.当循环发现不匹配时,它将以1k块重新扫描最后的8kiB,并且这些比较将循环遍历L1d中仍然很热的数据. (请注意,如果==发现不匹配,那么它可能只会触及到该点的数据,而不是整个8k.但是硬件预取将继续超出此范围.)

使用大型切片快速定位与不需要对同一数据进行多次传递之间的良好平衡是8倍.当然,这是一个可调参数,以及块大小. Python和asm之间的不匹配越大,减少Python循环迭代的这个因素就应该越小.)

希望8k足够大,可以隐藏Python循环/切片开销;在来自解释器的memcmp调用之间的Python开销期间,硬件预取应该仍然可以正常工作,因此我们不需要太大的粒度.但是对于非常大的字符串,如果8k不会使内存带宽饱和,则可以将其设置为64k(您的L2缓存为256kiB; i5确实告诉了我们很多).

memcmp到底有多快?

我正在Intel Core i5上运行它,但是我想我在大多数现代CPU上都能得到相同的结果.

即使在C语言中,为什么是memcmp memcmp比for循环检查要快得多吗?因为即使C编译器也不擅长(或完全不能)自动向量化搜索循环

即使没有硬件SIMD支持,即使在没有16字节或32字节SIMD的简单CPU上,优化的memcmp一次也可以检查4或8个字节(字长/寄存器宽度).

但是,大多数现代CPU和所有x86-64都具有SIMD指令. SSE2是x86-64的基线,并且可以作为32位模式的扩展.

SSE2或AVX2 memcmp可以使用pcmpeqb/pmovmskb并行比较16或32个字节. (我不会详细介绍如何在x86 asm或C内在函数中编写memcmp.Google分别在x86指令集参考中和/或查找这些asm指令.例如 x86标签wiki ,用于asm和性能链接,例如来自苹果公司在其开放源代码网站上发布的2005年版的x86-64 memcmp (采用AT& T语法汇编语言).肯定会更好.对于较大的缓冲区,它应该对齐一个源指针,而对于另一个则仅使用movdqu,允许movdqu然后pcmpeqb使用内存操作数而不是2x movdqu,即使字符串相对于彼此未对齐.在cmp/jcc可以宏融合而xor / jcc不能融合的CPU上,xorl $0xFFFF,%eax/jnz也不是最佳选择.

展开一次检查整个64字节高速缓存行也将隐藏循环开销. (这与大块的想法相同,然后在找到匹配项时在其上循环回去). Glibc的AVX2- movbe版本vpand进行此操作以在主大缓冲区循环中组合比较结果,最后的组合是vptest,该组合也从结果中设置标志. (较小的代码大小,但ups不小于vpand/vpmovmskb/cmp/jcc;但没有缺点,并且可能减少延迟,以减少对循环退出的分支错误预测的惩罚). Glibc在动态链接时进行动态CPU分配;它会在支持该版本的CPU上选择该版本.

希望这些天苹果的memcmp更好.不过,我在最新的Libc目录中根本看不到它的源代码.希望他们在运行时将其分派到适用于Haswell和更高版本CPU的AVX2版本.

我链接的版本中的LLoopOverChunks循环在Haswell上每约2.5个周期只能运行1次迭代(每个输入16个字节). 10个融合域微词.但这仍然比纯C循环每个周期1个字节要快得多,或者比Python循环要差得多.

Glibc的L(loop_4x_vec):循环是18个融合域微指令,因此当L1d缓存中的数据很热时,每个时钟周期运行时(每个输入)的字节数可能会略少于32个字节.否则,它将成为L2带宽的瓶颈.如果他们没有在循环内使用额外的指令来递减一个单独的循环计数器,并在循环外计算出终点指针,则可能为17微秒.


Python解释器自己的代码中的查找说明/热点

如何深入查找代码调用的C指令和CPU指令?

在Linux上,您可以先运行perf record python ...,再运行perf report -Mintel,以查看CPU在哪些功能上花费的时间最多,以及这些功能中的哪些指令最热门.您会得到类似于我在此处发布的结果:为什么是float()比int()更快?. (由于perf内置有反汇编程序,因此可以深入查看任何功能以查看实际运行的机器指令,并显示为汇编语言.)

有关在每个事件上对调用图进行采样的更细致的视图,请参见

但是对于这种特定情况,对在大型字符串上运行切片比较版本的解释器进行性能分析应该希望找到memcmp循环,我认为它将在其中花费大部分时间.(或者对于一次性字符版本,请找到热"的解释器代码.)

然后,您可以直接查看循环中的asm指令.从函数名称中,假设您的二进制文件中有任何符号,则可能可以找到源.或者,如果您具有使用debug info构建的Python版本,则可以直接从配置文件信息中获取源代码. (不是禁用了优化的调试版本,只是带有完整的符号).

I became curious to understand the internals of how string comparison works in python when I was solving the following example algorithm problem:

Given two strings, return the length of the longest common prefix

Solution 1: charByChar

My intuition told me that the optimal solution would be to start with one cursor at the beginning of both words and iterate forward until the prefixes no longer match. Something like

def charByChar(smaller, bigger):
  assert len(smaller) <= len(bigger)
  for p in range(len(smaller)):
    if smaller[p] != bigger[p]:
      return p
  return len(smaller)

To simplify the code, the function assumes that the length of the first string, smaller, is always smaller than or equal to the length of the second string, bigger.

Solution 2: binarySearch

Another method is to bisect the two strings to create two prefix substrings. If the prefixes are equal, we know that the common prefix point is at least as long as the midpoint. Otherwise the common prefix point is at least no bigger than the midpoint. We can then recurse to find the prefix length.

Aka binary search.

def binarySearch(smaller, bigger):
  assert len(smaller) <= len(bigger)
  lo = 0
  hi = len(smaller)

  # binary search for prefix
  while lo < hi:
    # +1 for even lengths
    mid = ((hi - lo + 1) // 2) + lo

    if smaller[:mid] == bigger[:mid]:
      # prefixes equal
      lo = mid
    else:
      # prefixes not equal
      hi = mid - 1

  return lo

At first I assumed that that binarySearch would be slower because string comparison would compare all characters several times rather than just the prefix characters as in charByChar.

Surpisingly, the binarySearch turned out to be much faster after some preliminary benchmarking.

Figure A

Above shows how performance is affected as prefix length is increased. Suffix length remains constant at 50 characters.

This graph shows two things:

  1. As expected, both algorithms perform linearly worse as prefix length increases.
  2. Performance of charByChar degrades at a much faster rate.

Why is binarySearch so much better? I think it is because

  1. The string comparison in binarySearch is presumably optimized by the interpreter / CPU behind the scenes.
  2. charByChar actually creates new strings for each character accessed and this produces significant overhead.

To validate this I benchmarked the performance of comparing and slicing a string, labelled cmp and slice respectively below.

Figure B

This graph show two important things:

  1. As expected, comparing and slicing increase linearly with length.
  2. The cost of comparing and slicing increase very slowly with length relative to algorithm performance, Figure A. Note both figures go up to strings of length 1 Billion characters. Therefore, the cost of comparing 1 character 1 Billion times is much much greater than comparing 1 Billion characters once. But this still doesn't answer why ...

Cpython

In an effort to find out how the cpython interpreter optimizes string comparison I generated the byte code for the following function.

In [9]: def slice_cmp(a, b): return a[0] == b[0]

In [10]: dis.dis(slice_cmp)
            0 LOAD_FAST                0 (a)
            2 LOAD_CONST               1 (0)
            4 BINARY_SUBSCR
            6 LOAD_FAST                1 (b)
            8 LOAD_CONST               1 (0)
           10 BINARY_SUBSCR
           12 COMPARE_OP               2 (==)
           14 RETURN_VALUE

I poked around the cpython code and found the following two pieces of code but I'm not sure this is where string comparison occurs.

The question

  • Where in the cpython does string comparison occur?
  • Is there a CPU optimization? Is there a special x86 instruction which does string comparison? How can I see what assembly instructions are generated by cpython? You may assume I am using python3 latest, Intel Core i5, OS X 10.11.6.
  • Why is comparing a long string so much faster than comparing each of it's characters?


Bonus question: When is charByChar more performant?

If the prefix is sufficiently small in comparison to the length rest of the string, at some point the cost of creating substrings in charByChar becomes less than the cost of comparing the substrings in binarySearch.

To describe this relationship I delved into runtime analysis.

Runtime analysis

To simplify the below equations, let's assume that smaller and bigger are the same size and I will refer to them as s1 and s2.

charByChar

charByChar(s1, s2) = costOfOneChar * prefixLen

Where the

costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)

Where cmp(1) is the cost of comparing two strings of length 1 char.

slice is the cost of accessing a char, the equivalent of charAt(i). Python has immutable strings and accessing a char actually creates a new string of length 1. slice(string_len, slice_len) is the cost of slicing a string of length string_len to a slice of size slice_len.

So

charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)

binarySearch

binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)

log_2 is the number of times to divide the strings in half until reaching a string of length 1. Where

costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)

So the big-O of binarySearch will grow according to

binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))

Based on our previous analysis of the cost of

If we assume that costOfHalfOfEachString is approximately the costOfComparingOneChar then we can refer to them both as x.

charByChar(s1, s2) = O(x * prefixLen)
binarySearch(s1, s2) = O(x * log_2(s1Len))

If we equate them

O(charByChar(s1, s2)) = O(binarySearch(s1, s2))
x * prefixLen = x * log_2(s1Len)
prefixLen = log_2(s1Len)
2 ** prefixLen = s1Len

So O(charByChar(s1, s2)) > O(binarySearch(s1, s2) when

2 ** prefixLen = s1Len

So plugging in the above formula I regenerated tests for Figure A but with strings of total length 2 ** prefixLen expecting the performance of the two algorithms to be roughly equal.

However, clearly charByChar performs much better. With a bit of trial and error, the performance of the two algorithms are roughly equal when s1Len = 200 * prefixLen

Why is the relationship 200x?

解决方案

TL:DR: a slice compare is some Python overhead + a highly-optimized memcmp (unless there's UTF-8 processing?). Ideally, use slice compares to find the first mismatch to within less than 128 bytes or something, then loop a char at a time.

Or if it's an option and the problem is important, make a modified copy of an asm-optimized memcmp that returns the position of the first difference, instead of equal/not-equal; it will run as fast as a single == of the whole strings. Python has ways to call native C / asm functions in libraries.

It's a frustrating limitation that the CPU can do this blazingly fast, but Python doesn't (AFAIK) give you access to an optimized compare loop that tells you the mismatch position instead of just equal / greater / less.


It's totally normal that interpreter overhead dominates the cost of the real work in a simple Python loop, with CPython. Building an algorithm out of optimized building blocks is worth it even if it means doing more total work. This is why NumPy is good, but looping over a matrix element-by-element is terrible. The speed difference might be something like a factor of 20 to 100, for CPython vs. a compiled C (asm) loop for comparing one byte at a time (made up numbers, but probably right to within an order of magnitude).

Comparing blocks of memory for equality is probably one of the biggest mismatches between Python loops vs. operating on a whole list / slice. It's a common problem with highly-optimized solutions (e.g. most libc implementations (including OS X) have a manually-vectorized hand-coded asm memcmp that uses SIMD to compare 16 or 32 bytes in parallel, and runs much faster than a byte-at-a-time loop in C or assembly). So there's another factor of 16 to 32 (if memory bandwidth isn't a bottleneck) multiplying the factor of 20 to 100 speed difference between Python and C loops. Or depending on how optimized your memcmp is, maybe "only" 6 or 8 bytes per cycle.

With data hot in L2 or L1d cache for medium-sized buffers, it's reasonable to expect 16 or 32 bytes per cycle for memcmp on a Haswell or later CPU. (i3/i5/i7 naming started with Nehalem; i5 alone is not sufficient to tell us much about your CPU.)

I'm not sure if either or both of your comparisons are having to process UTF-8 and check for equivalency classes or different ways to encode the same character. The worst case is if your Python char-at-a-time loop has to check for potentially-multi-byte characters but your slice compare can just use memcmp.


Writing an efficient version in Python:

We're just totally fighting against the language to get efficiency: your problem is almost exactly the same as the C standard library function memcmp, except you want the position of the first difference instead of a - / 0 / + result telling you which string is greater. The search loop is identical, it's just a difference in what the function does after finding the result.

Your binary search is not the best way to use a fast compare building block. A slice compare still has O(n) cost, not O(1), just with a much smaller constant factor. You can and should avoid re-comparing the starts of the buffers repeatedly by using slices to compare large chunks until you find a mismatch, then go back over that last chunk with a smaller chunk size.

# I don't actually know Python; consider this pseudo-code
# or leave an edit if I got this wrong :P
chunksize = min(8192, len(smaller))
# possibly round chunksize down to the next lowest power of 2?
start = 0
while start+chunksize < len(smaller):
    if smaller[start:start+chunksize] == bigger[start:start+chunksize]:
        start += chunksize
    else:
        if chunksize <= 128:
            return char_at_a_time(smaller[start:start+chunksize],  bigger[start:start+chunksize])
        else:
            chunksize /= 8        # from the same start

# TODO: verify this logic for corner cases like string length not a power of 2
# and/or a difference only in the last character: make sure it does check to the end

I chose 8192 because your CPU has a 32kiB L1d cache, so the total cache footprint of two 8k slices is 16k, half your L1d. When the loop finds a mismatch, it will re-scan the last 8kiB in 1k chunks, and these compares will loop over data that's still hot in L1d. (Note that if == found a mismatch, it probably only touched data up to that point, not the whole 8k. But HW prefetch will keep going somewhat beyond that.)

A factor of 8 should be a good balance between using large slices to localize quickly vs. not needing many passes over the same data. This is a tunable parameter of course, along with chunk size. The bigger the mismatch between Python and asm, the smaller this factor should be to reduce Python loop iterations.)

Hopefully 8k is big enough to hide the Python loop / slice overhead; hardware prefetching should still be working during the Python overhead between memcmp calls from the interpreter so we don't need the granularity to be huge. But for really big strings, if 8k doesn't saturate memory bandwidth then maybe make it 64k (your L2 cache is 256kiB; i5 does tell us that much).

How exactly is memcmp so fast:

I am running this on Intel Core i5 but I would imagine I would get the same results on most modern CPUs.

Even in C, Why is memcmp so much faster than a for loop check? memcmp is faster than a byte-at-a-time compare loop, because even C compilers aren't great at (or totally incapable of) auto-vectorizing search loops.

Even without hardware SIMD support, an optimized memcmp could check 4 or 8 bytes at a time (word size / register width) even on a simple CPU without 16-byte or 32-byte SIMD.

But most modern CPUs, and all x86-64, have SIMD instructions. SSE2 is baseline for x86-64, and available as an extension in 32-bit mode.

An SSE2 or AVX2 memcmp can use pcmpeqb / pmovmskb to compare 16 or 32 bytes in parallel. (I'm not going to go into detail about how to write memcmp in x86 asm or with C intrinsics. Google that separately, and/or look up those asm instructions in an x86 instruction-set reference. like http://felixcloutier.com/x86/index.html. See also the x86 tag wiki for asm and performance links. e.g. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? has some info about single-core memory bandwidth limitations.)

I found an old version from 2005 of Apple's x86-64 memcmp (in AT&T syntax assembly language) on their opensource web site. It could definitely be better; for large buffers it should align one source pointer and only use movdqu for the other one, allowing movdqu then pcmpeqb with a memory operand instead of 2x movdqu, even if the strings are misaligned relative to each other. xorl $0xFFFF,%eax / jnz is also not optimal on CPUs where cmp/jcc can macro fuse but xor / jcc can't.

Unrolling to check a whole 64-byte cache line at once would also hide loop overhead. (This is the same idea of a large chunk and then looping back over it when you find a hit). Glibc's AVX2-movbe version does this with vpand to combine compare results in the main large-buffer loop, with the final combine being a vptest that also sets flags from the result. (Smaller code-size but no fewer uops than vpand/vpmovmskb/cmp/jcc; but no downside and maybe lower latency to reduce branch-mispredict penalties on loop exit). Glibc does dynamic CPU dispatching at dynamic link time; it picks this version on CPUs that support it.

Hopefully Apple's memcmp is better these days; I don't see source for it at all in the most recent Libc directory, though. Hopefully they dispatch at runtime to an AVX2 version for Haswell and later CPUs.

The LLoopOverChunks loop in the version I linked would only run at 1 iteration (16 bytes from each input) per ~2.5 cycles on Haswell; 10 fused-domain uops. But that's still much faster than 1 byte per cycle for a naive C loop, or much much worse than that for a Python loop.

Glibc's L(loop_4x_vec): loop is 18 fused-domain uops, and can thus run at just slightly less than 32 bytes (from each input) per clock cycle, when data is hot in L1d cache. Otherwise it will bottleneck on L2 bandwidth. It could have been 17 uops if they hadn't used an extra instruction inside the loop decrementing a separate loop counter, and calculated an end-pointer outside the loop.


Finding instructions / hot spots in the Python interpreter's own code

How could I drill down to find the C instructions and CPU instructions that my code calls?

On Linux you could run perf record python ... then perf report -Mintel to see which functions the CPU was spending the most time in, and which instructions in those functions were the hottest. You'll get results something like I posted here: Why is float() faster than int()?. (Drill down into any function to see the actual machine instructions that ran, shown as assembly language because perf has a disassembler built in.)

For a more nuanced view that samples the call-graph on each event, see linux perf: how to interpret and find hotspots.

(When you're looking to actually optimize a program, you want to know which function calls are expensive so you can try to avoid them in the first place. Profiling for just "self" time will find hot spots, but you won't always know which different callers caused a given loop to run most of the iterations. See Mike Dunlavey's answer on that perf question.)

But for this specific case, profiling the interpreter running a slice-compare version over big strings should hopefully find the memcmp loop where I think it will be spending most of its time. (Or for the char-at-a-time version, find the interpreter code that's "hot".)

Then you can directly see what asm instructions are in the loop. From the function names, assuming your binary has any symbols, you can probably find the source. Or if you have a version of Python built with debug info , you can get to the source directly from profile info. (Not a debug build with optimization disabled, just with full symbols).

这篇关于为什么在python中字符串比较这么快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆