Dask-如何使用Apply将Series串联到DataFrame中? [英] Dask - How to concatenate Series into a DataFrame with apply?

查看:366
本文介绍了Dask-如何使用Apply将Series串联到DataFrame中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从Dask系列上应用的函数返回多个值? 我正在尝试从dask.Series.apply的每次迭代中返回一个序列,并将最终结果设为dask.DataFrame.

How do I return multiple values from a function applied on a Dask Series? I am trying to return a series from each iteration of dask.Series.apply and for the final result to be a dask.DataFrame.

以下代码告诉我,该元错误.但是,所有熊猫版本都可以使用.怎么了?

The following code tells me that the meta is wrong. The all-pandas version however works. What's wrong here?

更新:我认为我没有正确指定meta/schema.如何正确执行? 现在,当我删除meta参数时,它就可以工作了.但是,它会发出警告.我想正确"使用dask.

Update: I think that I am not specifying the meta/schema correctly. How do I do it correctly? Now it works when I drop the meta argument. However, it raises a warning. I would like to use dask "correctly".

import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()

def transformMyCol(x):
    #Minimal Example Function
    return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))

#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)

#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)

unpacked = df.target.apply(transformMyCol)
unpacked.head()

#
## Dask Version (first attempt) - Raises an exception 
#
df = dd.from_pandas(pandas_df, npartitions=10)

unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}

unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()

这是我得到的错误:

  File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
    raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata

我也了解以下内容,但它也不起作用.

I have also trued the following and it also does not work.

meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))


unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()

相同错误:

  File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
    raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata

推荐答案

您是对的,问题是您没有正确指定元数据.更具体地说,正如错误消息所述,元数据列("name", "action", "comments")与计算数据(0, 1, 2)中的列不匹配.您应该:

You're right, the problem is you're not specifying the meta correctly; more specifically and as the error message says, the metadata columns ("name", "action", "comments") do not match the columns in the computed data (0, 1, 2). You should either:

  1. 将元数据列更改为0、1、2:

   unpacked_dask_schema = dict.fromkeys(range(3), str)
   df.target.apply(transformMyCol, meta=unpacked_dask_schema)

  1. 更改transformMyCol以使用命名列:
  1. Change transformMyCol to use the named columns:


    def transformMyCol(x):
        return pd.Series({
            'name': 'Tom - ' + str(x), 
            'action': 'Deskflip - ' + str(x / 8), 
            'comments': '',
        }))

这篇关于Dask-如何使用Apply将Series串联到DataFrame中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆