dask数据框应用meta [英] dask dataframe apply meta

查看:198
本文介绍了dask数据框应用meta的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对任务数据框的单个列进行频率计数。该代码有效,但是我得到了警告抱怨说,未定义。如果我尝试定义,则会出现错误 AttributeError:'DataFrame'对象没有属性'name'。对于此特定用例,看起来好像不需要定义 meta ,但我想知道如何做以供将来参考。

I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case it doesn't look like I need to define meta but I'd like to know how to do that for future reference.

虚拟数据帧和列频率

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
                   ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
                   [12, 10, 15, 23, 18, 20, 26]],
                  index=['Column A', 'Column B', 'Column C']).T
dask_df = dd.from_pandas(df)







In [39]: dask_df.head()
Out[39]: 
  Column A Column B Column C
0      Sam      Sam       12
1     Alex    David       10
2    David    David       15
3    Sarah    Alice       23
4    Alice      Sam       18







(dask_df.groupby('Column B')
        .apply(lambda group: len(group))
       ).compute()

UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
Out[60]: 
Column B
Alice    2
David    2
Sam      3
dtype: int64






尝试定义会产生 AttributeError

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta={'Column B': 'int'})).compute()

与此相同

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()

如果我尝试使用 dtype int 而不是 int 'f8' np.float64 ,所以它似乎不是 dtype 引起问题。

same if I try having the dtype be int instead of "int" or for that matter 'f8' or np.float64 so it doesn't seem like it's the dtype that is causing the problem.

上的文档似乎暗示我应该完全按照自己的意愿做尝试做的事情( http://dask.pydata.org/en/latest /dataframe-design.html#metadata )。

The documentation on meta seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).

什么是?以及我应该如何定义它?

What is meta? and how am I supposed to define it?

使用 python 3.6 dask 0.14.3 pandas 0.20.2

推荐答案

是计算结果的名称/类型的规定。这是必需的,因为 apply()具有足够的灵活性,可以从数据框中生成几乎所有内容。如您所见,如果您不提供,那么dask实际上会计算部分数据,以查看类型是什么-很好,但是您应该知道它正在发生。
通过提供输出的零行版本(数据帧或系列),或者仅提供输出的零行版本,就可以避免这种预计算(可能会很昂贵),并且在知道输出的外观时更加明确。

meta is the prescription of the names/types of the output from the computation. This is required because apply() is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening. You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.

计算的输出实际上是一个序列,因此以下是最简单的方法

The output of your computation is actually a series, so the following is the simplest that works

(dask_df.groupby('Column B')
     .apply(len, meta=('int'))).compute()

,但更准确的是

(dask_df.groupby('Column B')
     .apply(len, meta=pd.Series(dtype='int', name='Column B')))

这篇关于dask数据框应用meta的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆