如何使用dask映射列 [英] How to map a column with dask

查看:72
本文介绍了如何使用dask映射列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在DataFrame列上应用映射.对于熊猫来说,这很简单:

I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:

df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))

这将基于custom_map函数写入infos列,并将数字行用于lambda语句.

This writes the infos column, based on the custom_map function, and uses the rows in numbers for the lambda statement.

使用黄昏并不是那么简单. ddf是一个简单的DataFrame. map_partitions等效于在DataFrame的一部分上并行执行映射.

With dask this isn't that simple. ddf is a dask DataFrame. map_partitions is the equivalent to parallel execution of the mapping on a part of the DataFrame.

无效不起作用,因为您没有在快速操作中定义类似的列.

This does not work because you don't define columns like that in dask.

ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))

有人知道我如何在这里使用列吗?我不理解他们的 API文档完全没有.

Does anyone know how I can use columns here? I don't understand their API documentation at all.

推荐答案

您可以使用

You can use the .map method, exactly as in Pandas

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'x': [1, 2, 3]})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: df.x.map(lambda x: x + 1)
Out[5]: 
0    2
1    3
2    4
Name: x, dtype: int64

In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]: 
0    2
1    3
2    4
Name: x, dtype: int64

元数据

可能会要求您提供一个meta=关键字.这使dask.dataframe知道函数的输出名称和类型.从map_partitions此处复制文档字符串:

Metadata

You may be asked to provide a meta= keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions here:

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and 
column names of the output. This metadata is necessary for many 
algorithms in dask dataframe to work. For ease of use, some 
alternative inputs are also available. Instead of a DataFrame, 
a dict of {name: dtype} or iterable of (name, dtype) can be 
provided. Instead of a series, a tuple of (name, dtype) can be 
used. If not provided, dask will try to infer the metadata. 
This may lead to unexpected results, so providing meta is  
recommended. 

For more information, see dask.dataframe.utils.make_meta.

因此在上面的示例中,我的输出将是名称为'x'和dtype int的系列,我可以执行以下任一操作来更明确地显示

So in the example above, where my output will be a series with name 'x' and dtype int I can do either of the following to be more explicit

>>> ddf.x.map(lambda x: x + 1, meta=('x', int))

>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))

这告诉dask.dataframe对我们的函数有什么期望.如果未提供任何元数据,则dask.dataframe将尝试在少量数据上运行您的函数.如果操作失败,它将引发错误以寻求帮助.

This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.

这篇关于如何使用dask映射列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆