如何使用dask映射列 [英] How to map a column with dask
问题描述
我想在DataFrame列上应用映射.对于熊猫来说,这很简单:
I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:
df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))
这将基于custom_map
函数写入infos
列,并将数字行用于lambda
语句.
This writes the infos
column, based on the custom_map
function, and uses the rows in numbers for the lambda
statement.
使用黄昏并不是那么简单. ddf
是一个简单的DataFrame. map_partitions
等效于在DataFrame的一部分上并行执行映射.
With dask this isn't that simple. ddf
is a dask DataFrame. map_partitions
is the equivalent to parallel execution of the mapping on a part of the DataFrame.
这无效不起作用,因为您没有在快速操作中定义类似的列.
This does not work because you don't define columns like that in dask.
ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))
有人知道我如何在这里使用列吗?我不理解他们的 API文档完全没有.
Does anyone know how I can use columns here? I don't understand their API documentation at all.
推荐答案
You can use the .map method, exactly as in Pandas
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'x': [1, 2, 3]})
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: df.x.map(lambda x: x + 1)
Out[5]:
0 2
1 3
2 4
Name: x, dtype: int64
In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]:
0 2
1 3
2 4
Name: x, dtype: int64
元数据
可能会要求您提供一个meta=
关键字.这使dask.dataframe知道函数的输出名称和类型.从map_partitions
此处复制文档字符串:
Metadata
You may be asked to provide a meta=
keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions
here:
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty pd.DataFrame or pd.Series that matches the dtypes and
column names of the output. This metadata is necessary for many
algorithms in dask dataframe to work. For ease of use, some
alternative inputs are also available. Instead of a DataFrame,
a dict of {name: dtype} or iterable of (name, dtype) can be
provided. Instead of a series, a tuple of (name, dtype) can be
used. If not provided, dask will try to infer the metadata.
This may lead to unexpected results, so providing meta is
recommended.
For more information, see dask.dataframe.utils.make_meta.
因此在上面的示例中,我的输出将是名称为'x'
和dtype int
的系列,我可以执行以下任一操作来更明确地显示
So in the example above, where my output will be a series with name 'x'
and dtype int
I can do either of the following to be more explicit
>>> ddf.x.map(lambda x: x + 1, meta=('x', int))
或
>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))
这告诉dask.dataframe对我们的函数有什么期望.如果未提供任何元数据,则dask.dataframe将尝试在少量数据上运行您的函数.如果操作失败,它将引发错误以寻求帮助.
This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.
这篇关于如何使用dask映射列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!