如何使用多列作为函数输入，将自定义函数应用于快速数据帧中的组 [英] How to apply a custom function to groups in a dask dataframe, using multiple columns as function input

查看：51 发布时间：2021/4/28 19:35:43 python pandas dataframe group-by dask

本文介绍了如何使用多列作为函数输入，将自定义函数应用于快速数据帧中的组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的数据框，正在使用黄昏处理.数据框大致看起来像这样:

I have a very large dataframe that I'm handling with dask. The dataframe looks by and large like this:

Col_1    Col_2   Bool_1   Bool_2
A        1       True     False
B        1       True     True
C        1       False    False
D        1       True     False
A        2       False    True
B        2       False    False
C        2       True     False
D        2       True     True

但是它有数百万行.

我现在要在代码中进行的操作是为每个组计算 Bool_1 和 Bool_2 之间的 Jaccard距离在 Col_2 中形成.这是因为该程序的目的是为 Col_2 中的每个组生成一行(每行都有一些统计信息，我仅报告相关列).

What I'm trying to do at this point of the code is to calculate a Jaccard distance between Bool_1 and Bool_2 for each group formed in Col_2. This, because the aim of this program is to produce one line for each group that is present in Col_2 (each line has several statistics, I'm reporting only the relevant columns).

为此，我首先使用 df.groupby("Col_2")通过 Col_2 对数据帧进行分组，但是随后我不知道如何进行.到目前为止，我尝试过的每一次尝试都引发了错误.

To do so, I first group the dataframe by Col_2 using df.groupby("Col_2"), but then I don't know how to proceed. Every attempt I tried so far has thrown an error.

1 :我试图定义一个函数 compute_jacc_dist()并将其通过 apply(compute_jacc_dist，axis = 1)传递给组，但它与args和kwargs有关(尤其是轴，请参见 https://github.com/dask/dask/issues/1572 ，我还无法解决).

1: I tried to define a function compute_jacc_dist() and to pass it via apply(compute_jacc_dist, axis=1) to the groups, but it has issues with the args and kwargs (the axis especially, see https://github.com/dask/dask/issues/1572 , which I couldn't solve yet).

2 :我尝试使用dask_distance import jaccard 中的并使用它来计算 Bool_1 和 Bool_2 ，但它会产生奇怪的结果(即使没有交集，每个组也会返回J = 1).


2: I tried to use from dask_distance import jaccard and use it to compute the J distance between Bool_1 and Bool_2 but it produces weird results (each group returns J=1 even if there is NO intersection). 
  3 :我试图对数据框进行 compute()并使用以下方法对组进行迭代:
3: I tried to compute() the dataframe and to iterate over the groups using: 
for name, group in df.groupby("Col_2"):
   jacc = dask_distance.jaccard(group["Bool_1"], group["Bool_2"])

但是这太慢了，因为它会触发计算，然后逐组处理如此庞大的数据帧(即，我不想使用它).作为参考，具有此功能的脚本已经运行了两天，而我估计，如果尝试正确设置了任何方法#1和#2，它们都会在1-2小时内返回结果.
But this one is slow as hell because it triggers a computation and then operates over such a huge dataframe group by group (i.e. I don't want to use it). For reference, a script with this function is running since two days, while I estimate that any of the solutions #1 and #2 I have tried, if properly set, would return results in 1-2 hours. 
关于如何处理此问题的任何建议?我的理想解决方案是以适当的方式使用 df.groupby("Col_1").apply(compute_jacc_dist).任何帮助，不胜感激！
Any suggestion on how I could handle this issue? My ideal solution would be to use df.groupby("Col_1").apply(compute_jacc_dist) in a proper way. Any help much appreciated! 
推荐答案
经过数小时的尝试，这是我的操作方法.如果您正在阅读此书，则可能想阅读此书(将多个函数应用于多个groupby列).
After many hours of trying, here's how I did it. If you're reading this, you may wanna read this (How to apply euclidean distance function to a groupby object in pandas dataframe?) and this (Apply multiple functions to multiple groupby columns).
def my_function(x):

    d = {}
    v1 = np.array(x["Bool_1"])
    v2 = np.array(x["Bool_2"])
    intersection = np.logical_and(v1, v2).sum()
    union = np.logical_or(v1, v2).sum()
    d["Jaccard"] = float(intersection) / float(union)
    return pd.Series(d, index=["Jaccard"])

df = df.groupby("Col_2").apply(my_function, meta={"Jaccard":"float16"}).compute()

 说明 
我创建一个函数来计算数据框两列之间的Jaccard距离.在该函数中，我创建了一个字典( d )，其中将包含我的计算结果.
I create a function that computes the Jaccard distance between the two columns of my dataframe. Within the function, I create a dictionary (d) which will contain the results of my computations. 
拥有字典的好处是我可以添加任意数量的计算，尽管这里只有一个.
A perk of having a dictionary is that I can add as many computations as I want, although here there is only one. 
该函数然后返回包含字典的 pd.Series .
The function then returns a pd.Series containing the dictionary. 
该功能适用于基于 Col_2 的数据框组.在 apply()中指定了 meta 数据类型，整个内容的末尾都有 compute()，因为它是一个淡淡的数据帧和一个必须触发计算才能获得结果.
The function is applied to the dataframe groups, which are based on Col_2. meta data types are specified within apply(), and the whole thing has compute() at the end, since it's a dask dataframe and a computation must be triggered to get the result. 
  apply()应该具有与输出列一样多的 meta .
The apply() should have as many meta as there are output columns.

                        这篇关于如何使用多列作为函数输入，将自定义函数应用于快速数据帧中的组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何使用多列作为函数输入，将自定义函数应用于快速数据帧中的组 [英] How to apply a custom function to groups in a dask dataframe, using multiple columns as function input

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用多列作为函数输入，将自定义函数应用于快速数据帧中的组 [英] How to apply a custom function to groups in a dask dataframe, using multiple columns as function input

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭