Cythonising Pandas:内容,索引和列的ctypes [英] Cythonising Pandas: ctypes for content, index and columns

查看:54
本文介绍了Cythonising Pandas:内容,索引和列的ctypes的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Cython的新手,但已经经历了非凡的加速,只需将我的 .py 复制到 .pyx (并cimport cython numpy 等),然后使用 pyximport 导入到 ipython3 中.许多教程都从这种方法开始,下一步是为每种数据类型添加 cdef 声明,我可以在for循环中为迭代器执行此操作.但是,与大多数Pandas Cython教程或示例不同的是,我不是应用函数,而是使用切片,求和和除法(等)来更多地处理数据.

I am very new to Cython, yet am already experiencing extraordinary speedups just copying my .py to .pyx (and cimport cython, numpy etc) and importing into ipython3 with pyximport. Many tutorials start in this approach with the next step being to add cdef declarations for every data type, which I can do for the iterators in my for loops etc. But unlike most Pandas Cython tutorials or examples I am not apply functions so to speak, more manipulating data using slices, sums and division (etc).

所以问题是:我是否可以通过声明DataFrame仅包含浮点数( double ),列为 int 和是 int ?

So the question is: Can I increase the speed at which my code runs by stating that my DataFrame only contains floats (double), with columns that are int and rows that are int?

如何定义嵌入列表的类型?即 [[int,int],[int]]

How to define the type of an embedded list? i.e [[int,int],[int]]

下面是一个生成DF分区的AIC得分的示例,抱歉,它太冗长了:

Here is an example that generates the AIC score for a partitioning of a DF, sorry it is so verbose:

    cimport cython
    import numpy as np
    cimport numpy as np
    import pandas as pd

    offcat = [
        "breakingPeace", 
        "damage", 
        "deception", 
        "kill", 
        "miscellaneous", 
        "royalOffences", 
        "sexual", 
        "theft", 
        "violentTheft"
        ]

    def partitionAIC(EmpFrame, part, OffenceEstimateFrame, ReturnDeathEstimate=False):
        """EmpFrame is DataFrame of ints, part is nested list of ints, OffenceEstimate frame is DF of float"""
        """partOf/block is a list of ints"""
        """ll, AIC,  is series/frame of floats"""
        ##Cython cdefs
        cdef int DFlen
        cdef int puns
        cdef int DeathPun    
        cdef int k
        cdef int pId
        cdef int punish

        DFlen = EmpFrame.shape[1]
        puns = 2
        DeathPun = 0
        PartitionModel = pd.DataFrame(index = EmpFrame.index, columns = EmpFrame.columns)

        for partOf in part:
            Grouping = [puns*x + y for x in partOf for y in list(range(0,puns))]
            PartGroupSum = EmpFrame.iloc[:,Grouping].sum(axis=1)

            for punish in range(0,puns):
                PunishGroup = [x*puns+punish for x in partOf]
                punishPunishment = ((EmpFrame.iloc[:,PunishGroup].sum(axis = 1) + 1/puns).div(PartGroupSum+1)).values[np.newaxis].T
                PartitionModel.iloc[:,PunishGroup] = punishPunishment
        PartitionModel = PartitionModel*OffenceEstimateFrame

        if ReturnDeathEstimate:
            DeathProbFrame = pd.DataFrame([[part]], index=EmpFrame.index, columns=['Partition'])
            for pId,block in enumerate(part):
                DeathProbFrame[pId] = PartitionModel.iloc[:,block[::puns]].sum(axis=1)
            DeathProbFrame = DeathProbFrame.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[X] for X in  x ] 
                for idx,x in enumerate(row['Partition'])],
                key=lambda x: x[0], reverse=True),axis=1)
        ll = (EmpFrame*np.log(PartitionModel.convert_objects(convert_numeric=True))).sum(axis=1)
        k = (len(part))*(puns-1)
        AIC = 2*k-2*ll

        if ReturnDeathEstimate:
            return AIC, DeathProbFrame
        else:
            return AIC

推荐答案

我的建议是在大熊猫中尽可能尽可能多地.这是一种标准建议,首先使其工作,然后在真正重要的情况下关心性能".因此,让我们假设您已经完成了该操作(希望您也编写了一些测试),并且它太慢了:

My advice is to do as much as possible in pandas. This is kinda standard advice "get it working first, then care about performance if it really matters". So let's suppose you've done that (hopefully you've written some tests too), and it's too slow:

配置您的代码.(请参见此SO答案,或在ipython中使用%prun).

Profile your code. (See this SO answer, or use %prun in ipython).

prun的输出应该驱动下一步需要改进的地方.

  1. pandas(使您的代码更可笑,这可以很有帮助).
  2. numpy(不创建中间Series/DataFrame,请谨慎使用dtypes)
  3. cython(万不得已).
  1. pandas (make your code more pandorable, this can help a lot).
  2. numpy (not creating intermediary Series/DataFrames, being careful about dtypes)
  3. cython (the last resort).

现在,如果与切片相关(可能不是),请将微小部分放在cython中,我想删除对cython函数的单个python函数调用.关于这一点,使用cython的东西应该使用 numpy 而不是pandas,我不认为pandas不会降低到C(cython无法推断类型).

Now, if it is a line to do with slicing (it probably isn't) put that tiny part in cython, I like to remove single python function calls to cython function. On that point stuff with cython should use numpy not pandas, I don't think pandas is not going to lower to C (cython can't infer types).

将整个代码放入cython实际上并没有太大帮助,您只想放置对性能敏感的特定行或函数调用.保持cython专注是度过美好时光的唯一途径.

Putting your entire code into cython won't actually help that much, you want to only put the specific lines, or function calls, which are performance sensitive. Keeping cython focussed is the only way to have a good time.

阅读增强熊猫文档的性能部分 *!在这里,此过程(修剪-> cythonize->类型)通过一个真实的示例逐步进行了说明.

Read the enhancing performance section of the pandas docs*! Here this process (prun -> cythonize -> type) is gone over step-by-step with a real-life example.

*全披露我是在文档的那部分写的!:)

这篇关于Cythonising Pandas:内容,索引和列的ctypes的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆