在dask数据框中添加一列，并通过滚动窗口对其进行计算 [英] Adding a column to dask dataframe, computing it through a rolling window

查看：90 发布时间：2020/5/18 23:25:32 python pandas numpy dask rolling-computation

本文介绍了在dask数据框中添加一列，并通过滚动窗口对其进行计算的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有以下代码，以生成一个虚拟的dask数据帧:

Suppose I have the following code, to generate a dummy dask dataframe:

import pandas as pd
import dask.dataframe as dd
pandas_dataframe = pd.DataFrame({'A' : [0,500,1000], 'B': [-100, 200, 300]  , 'C' : [0,0,1.0] } )    
test_data_frame = dd.from_pandas( pandas_dataframe, npartitions= 1  )

理想情况下，我想知道推荐的一种方法是将另一列添加到数据框中，以一种懒惰的方式通过滚动窗口计算列的内容.

Ideally I would like to know what is the recommended way to add another column to the data frame, computing the column content through a rolling window, in a lazy fashion.

我想出了以下方法:

import numpy as np
import dask.delayed as delay

@delay
def coupled_operation_example(dask_dataframe, 
                              list_of_input_lbls, 
                              fcn, 
                              window_size, 
                              init_value, 
                              output_lbl):

    def preallocate_channel_data(vector_length, first_components):
        vector_out = np.zeros(len(dask_dataframe))
        vector_out[0:len(first_components)] = first_components
        return vector_out

    def create_output_signal(relevant_data, fcn, window_size , initiated_vec):

       ## to be written; fcn would be  a fcn accepting the sliding window


    initiatied_vec = preallocate_channel_data(len(dask_dataframe, init_value))
    relevant_data = dask_dataframe[list_of_input_lbls]
    my_output_signal = create_output_signal(relevant_data, fcn, window_size, initiated_vec)

我正在写这篇文章，坚信dask数据框将允许我进行一些切片:而事实并非如此.因此，我的第一个选择是将计算中涉及的列提取为numpy数组，但将对其进行热切评估.我认为性能上的损失将是巨大的.目前，我使用h5py从h5数据创建了dask数据帧:所以一切都很懒惰，直到我写输出文件为止.

I was writing this, convinced that dask dataframe would allow me some slicing: they do not. So, my first option would be to extract the columns involved in the computations as numpy arrays, but so they would be eagerly evaluated. I think the penalty in performance would be significant. At the moment I create dask dataframes from h5 data, using h5py: so everything is lazy, until I write output files.

到目前为止，我只处理特定行上的数据；所以我一直在使用:

Up to now I was processing only data on a certain row; so I had been using:

 test_data_frame .apply(fcn, axis =1, meta = float)

我认为滚动窗口没有等效的功能方法；我对吗?我想要F#或Haskell中的Seq.windowed之类的东西.任何建议都值得赞赏.

I do not think there is an equivalent functional approach for rolling windows; am I right? I would like something like Seq.windowed in F# or Haskell. Any suggestion highly appreciated.

推荐答案

我试图通过闭包解决它.完成代码后，我将在一些数据上发布基准测试.现在，我有一个下面的玩具示例，它似乎可以正常工作:由于dask数据框的apply方法似乎保留了行顺序.

I have tried to solve it through a closure. I will post benchmarks on some data, as soon as I have finalized the code. For now I have the following toy example, which seems to work: since dask dataframe's apply methods seems to be preserving the row order.

import numpy as np
import pandas as pd
import dask.dataframe as dd
number_of_components = 30


df = pd.DataFrame(np.random.randint(0,number_of_components,size=(number_of_components, 2)), columns=list('AB'))
my_data_frame = dd.from_pandas(df, npartitions = 1 )


def sumPrevious( previousState ) :

     def getValue(row):
        nonlocal previousState 
        something = row['A'] - previousState 
        previousState = row['A']
        return something

     return getValue


given_func = sumPrevious(1 )
out = my_data_frame.apply(given_func, axis = 1 , meta = float)
df['computed'] = out.compute()

现在，坏消息是，我试图通过此新函数将其抽象出来，传递状态并使用任意宽度的滚动窗口:

Now the bad news, I have tried to abstract it out, passing the state around and using a rolling window of any width, through this new function:

def generalised_coupled_computation(previous_state , coupled_computation, previous_state_update) :

    def inner_function(actual_state):
        nonlocal previous_state
        actual_value = coupled_computation(actual_state , previous_state  )
        previous_state = previous_state_update(actual_state, previous_state)
        return actual_value

    return inner_function

假设我们使用以下方法初始化函数:

Suppose we initialize the function with:

init_state = df.loc[0] 
coupled_computation  = lambda act,prev : act['A'] - prev['A']
new_update = lambda act, prev : act
given_func3 = generalised_coupled_computation(init_state , coupled_computation, new_update )
out3 = my_data_frame.apply(given_func3, axis = 1 , meta = float)

尝试运行它并为意外做好准备:考虑到奇怪的结果，第一个元素是错误的，可能是某些指针的问题.有见识吗?

Try to run it and be ready for surprises: the first element is wrong, possibly some pointer's problems, given the odd result. Any insight?

无论如何，如果一个传递原始类型，它似乎就起作用了.

Anyhow, if one passes primitive types, it seems to function.

更新:

解决方案是使用副本:

import copy as copy

def new_update(act, previous):
    return copy.copy(act)

现在，函数的行为符合预期；当然，如果需要更耦合的逻辑，就必须调整函数更新和耦合的计算功能.

Now the functions behaves as expected; of course it is necessary to adapt the function updates and the coupled computation function if one needs a more coupled logic

这篇关于在dask数据框中添加一列，并通过滚动窗口对其进行计算的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在dask数据框中添加一列，并通过滚动窗口对其进行计算 [英] Adding a column to dask dataframe, computing it through a rolling window

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在dask数据框中添加一列，并通过滚动窗口对其进行计算 [英] Adding a column to dask dataframe, computing it through a rolling window

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭