Cythonize字符串的所有分割列表 [英] Cythonize list of all splits of a string

查看:84
本文介绍了Cythonize字符串的所有分割列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加速一段代码,该代码生成所有可能的字符串拆分。

I'm trying to speed up a piece of code that generates all possible splits of a string.

splits('foo') -> [('f', 'oo'), ('fo', 'o'), ('foo', '')]

在python中的代码非常简单:

The code for this in python is very simple:

def splits(text):
    return [(text[:i + 1], text[i + 1:])
            for i in range(len(text))]

是否可以通过cython或其他方法来加快此速度?对于上下文,此代码的主要目的是查找具有最高概率的字符串的分割。

Is there a way to speed this up via cython or some other means? For context, the greater purpose of this code is to find the split of a string with the highest probability.

推荐答案

这是 Cython可以帮助解决的问题。它使用切片,其最终速度与纯Python相同(即实际上还不错)。

This isn't the sort of problem that Cython tends to help with much. It's using slicing, which ends up largely the same speed as pure Python (i.e. actually pretty good).

使用100个字符的长字节字符串( b'0'* 100 )和 timeit 的10000次迭代我得到:

Using a 100 character long byte string (b'0'*100) and 10000 iterations in timeit I get:


  • 您编写的代码-0.37s

  • 您编写的但在Cython中编译的代码-0.21s

  • 您的代码包含该行 cdef int i 并在Cython中编译-0.20s(可再现地是一个很小的改进。对于更长的字符串更重要)

  • 您的 cdef int i ,参数键入为 bytes文本-0.28s(即更糟)。

  • 直接使用Python C API可获得最佳速度(请参见下面的代码)-0.11s。为了方便起见,我选择主要在Cython中执行此操作(但我自己调用API函数),但是您可以直接在C语言中编写非常相似的代码,但要多进行一些手动错误检查。我已经为Python 3 API编写了此代码,假定您正在使用字节对象(即 PyBytes 而不是 PyString )因此,如果您使用的是Python 2,Unicode和Python 3,则必须对其稍作更改。

  • Your code as written - 0.37s
  • Your code as written but compiled in Cython - 0.21s
  • Your code with the line cdef int i and compiled in Cython - 0.20s (this is reproducably a small improvement. It's more significant with longer strings)
  • Your cdef int i and the parameter typed to bytes text - 0.28s (i.e. worse).
  • Best speed is got by using the Python C API directly (see code below) - 0.11s. I've chosen to do this mostly in Cython (but calling the API functions myself) for convenience, but you could write very similar code in C directly with a little more manual error checking. I've written this for the Python 3 API assuming you're using bytes objects (i.e. PyBytes instead of PyString) so if you're using Python 2, or Unicode and Python 3 you'll have to change it a little.

from cpython cimport *
cdef extern from "Python.h":
    # This isn't included in the cpython definitions
    # using PyObject* rather than object lets us control refcounting
    PyObject* Py_BuildValue(const char*,...) except NULL

def split(text):
   cdef Py_ssize_t l,i
   cdef char* s

   # Cython automatically checks the return value and raises an error if 
   # these fail. This provides a type-check on text
   PyBytes_AsStringAndSize(text,&s,&l)
   output = PyList_New(l)

   for i in range(l):
       # PyList_SET_ITEM steals a reference
       # the casting is necessary to ensure that Cython doesn't
       # decref the result of Py_BuildValue
       PyList_SET_ITEM(output,i,
                       <object>Py_BuildValue('y#y#',s,i+1,s+i+1,l-(i+1)))
   return output


  • 如果您不想一路使用C API,则可以使用预先分配列表的版本 output = [None] * len(text)并进行for循环而不是列表理解比原始版本(0.18s

  • If you don't want to go all the way with using the C API then a version that preallocates the list output = [None]*len(text) and does a for-loop rather than a list comprehension is marginally more efficient than your original version - 0.18s

    总而言之,仅在Cython中进行编译即可使您获得不错的速度(略小于2倍)并设置 i 会有所帮助。传统上,这是使用Cython可以真正实现的所有功能。为了获得最快的速度,您基本上需要直接使用Python C API。这样会使您的速度提高4倍以下,我认为这相当不错。

    In summary, just compiling it in Cython gives you a decent speed up (a bit less than 2x) and setting the type of i helps a little. This is all you can really achieve with Cython conventionally. To get full speed you basically need to resort to using the Python C API directly. That gets you a little under a 4x speed up which I think is pretty decent.

    这篇关于Cythonize字符串的所有分割列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆