如何高效地使用pandas.cut()(或等效方法)? [英] How to use pandas.cut() (or equivalent) in dask efficiently?

查看：256 发布时间：2020/5/24 2:28:05 python pandas dask

本文介绍了如何高效地使用pandas.cut()(或等效方法)?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试在Python中对大型数据集进行分类和分组.它是具有属性(位置X，位置Y，能量，时间)的已测量电子的列表.我需要将其沿positionX，positionY分组，并在能量类别中进行分箱.

I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes.

到目前为止，我可以用熊猫来做，但是我想并行运行它.所以，我尝试使用dask.

So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask.

groupby方法效果很好，但是不幸的是，当尝试对能量中的数据进行 bin 时，我遇到了困难.我找到了使用pandas.cut()的解决方案，但它需要在原始数据集上调用compute()(将其本质上转换为非并行代码).在dask中是否有等效于pandas.cut()的方法，或者还有另一种(优雅的)方法可以实现相同的功能?

The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a solution using pandas.cut(), but it requires to call compute() on the raw dataset (turning it essentialy into non-parallel code). Is there an equivalent to pandas.cut() in dask, or is there another (elegant) way to achieve the same functionality?

import dask 
# create dask dataframe from the array
dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy'))

# Set the bins to bin along energy
bins = range(0, 10000, 500)

# Create the cut in energy (using non-parallel pandas code...)
energyBinner=pandas.cut(dd['energy'],bins)

# Group the data according to posX, posY and energy
grouped = dd.compute().groupby([energyBinner, 'posX', 'posY'])

# Apply the count() method to the data:
numberOfEvents = grouped['time'].count()

非常感谢！

如何高效地使用pandas.cut()(或等效方法)? [英] How to use pandas.cut() (or equivalent) in dask efficiently?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何高效地使用pandas.cut()(或等效方法)? [英] How to use pandas.cut() (or equivalent) in dask efficiently?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭