dask数据框应用meta [英] dask dataframe apply meta
问题描述
我想对任务
数据框的单个列进行频率计数。该代码有效,但是我得到了警告
抱怨说,未定义元
。如果我尝试定义元
,则会出现错误 AttributeError:'DataFrame'对象没有属性'name'
。对于此特定用例,看起来好像不需要定义 meta
,但我想知道如何做以供将来参考。
I'm wanting to do a frequency count on a single column of a dask
dataframe. The code works, but I get an warning
complaining that meta
is not defined. If I try to define meta
I get an error AttributeError: 'DataFrame' object has no attribute 'name'
. For this particular use case it doesn't look like I need to define meta
but I'd like to know how to do that for future reference.
虚拟数据帧和列频率
import pandas as pd
from dask import dataframe as dd
df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
[12, 10, 15, 23, 18, 20, 26]],
index=['Column A', 'Column B', 'Column C']).T
dask_df = dd.from_pandas(df)
In [39]: dask_df.head()
Out[39]:
Column A Column B Column C
0 Sam Sam 12
1 Alex David 10
2 David David 15
3 Sarah Alice 23
4 Alice Sam 18
(dask_df.groupby('Column B')
.apply(lambda group: len(group))
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
Out[60]:
Column B
Alice 2
David 2
Sam 3
dtype: int64
尝试定义元
会产生 AttributeError
(dask_df.groupby('Column B')
.apply(lambda d: len(d), meta={'Column B': 'int'})).compute()
与此相同
(dask_df.groupby('Column B')
.apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()
如果我尝试使用 dtype
是 int
而不是 int
'f8'
或 np.float64
,所以它似乎不是 dtype
引起问题。
same if I try having the dtype
be int
instead of "int"
or for that matter 'f8'
or np.float64
so it doesn't seem like it's the dtype
that is causing the problem.
元
上的文档似乎暗示我应该完全按照自己的意愿做尝试做的事情( http://dask.pydata.org/en/latest /dataframe-design.html#metadata )。
The documentation on meta
seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).
什么是元
?以及我应该如何定义它?
What is meta
? and how am I supposed to define it?
使用 python 3.6
dask 0.14.3
和 pandas 0.20.2
推荐答案
元
是计算结果的名称/类型的规定。这是必需的,因为 apply()
具有足够的灵活性,可以从数据框中生成几乎所有内容。如您所见,如果您不提供元
,那么dask实际上会计算部分数据,以查看类型是什么-很好,但是您应该知道它正在发生。
通过提供输出的零行版本(数据帧或系列),或者仅提供输出的零行版本,就可以避免这种预计算(可能会很昂贵),并且在知道输出的外观时更加明确。
meta
is the prescription of the names/types of the output from the computation. This is required because apply()
is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta
, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening.
You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.
计算的输出实际上是一个序列,因此以下是最简单的方法
The output of your computation is actually a series, so the following is the simplest that works
(dask_df.groupby('Column B')
.apply(len, meta=('int'))).compute()
,但更准确的是
(dask_df.groupby('Column B')
.apply(len, meta=pd.Series(dtype='int', name='Column B')))
这篇关于dask数据框应用meta的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!