Dask数据框-根据定界符将列拆分为多行 [英] Dask dataframe - split column into multiple rows based on delimiter

查看:83
本文介绍了Dask数据框-根据定界符将列拆分为多行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用dask数据框将一列分为多行的有效方法是什么?例如,假设我有一个使用dask读取的csv文件,以生成以下dask数据帧:

What is an efficient way of splitting a column into multiple rows using dask dataframe? For example, let's say I have a csv file which I read using dask to produce the following dask dataframe:

id var1 var2
1  A    Z,Y
2  B    X
3  C    W,U,V

我想将其转换为:

id var1 var2
1  A    Z
1  A    Y
2  B    X
3  C    W
3  C    U
3  C    V

我已经调查了 Split的答案(爆炸)熊猫数据框字符串条目以分隔行熊猫:如何将一列中的文本分成多行?

I have looked into the answers for Split (explode) pandas dataframe string entry to separate rows and pandas: How do I split text in a column into multiple rows?.

我尝试应用给出的答案 https://stackoverflow.com/a/17116976 / 7275290 ,但是dask似乎没有接受str.split中的expand关键字。

I tried applying the answer given in https://stackoverflow.com/a/17116976/7275290 but dask does not appear to accept the expand keyword in str.split.

我也尝试应用https://stackoverflow.com/a/40449726/7275290 ,但随后发现np.repeat不能用整数数组快速实现( https://github.com/dask/dask/issues/2946 )。

I also tried applying the vectorized approach suggested in https://stackoverflow.com/a/40449726/7275290 but then found out that np.repeat isn't implemented in dask with integer arrays (https://github.com/dask/dask/issues/2946).

我在熊猫中尝试了其他几种方法,但是它们确实很慢-借助dask可能会更快,但是我想首先检查是否有人在任何特定方法上都成功。我正在使用的数据集包含超过1000万行和10列(字符串数据)。分成几行后,大概会变成5000万行。

I tried out a few other methods in pandas but they were really slow - might be faster with dask but I wanted to check first if anyone had success with any particular method. I'm working with a dataset with over 10 million rows and 10 columns (string data). After splitting into rows it'll probably become ~50 million rows.

谢谢您的关注!我很感激。

Thank you for looking into this! I appreciate it.

推荐答案

Dask允许您直接将熊猫用于按行操作(例如这样)或可以在一个分区上应用一个时间。请记住,Dask数据框由一组Pandas数据框组成。

Dask allows you to use pandas directly for operations that are row-wise (like this) or can be applied one partition at a time. Remember that a Dask dataframe consists of a set of Pandas dataframes.

对于Pandas案例,您可以根据链接的问题执行此操作:

For the Pandas case you would do this, based on the linked questions:

df = pd.DataFrame([["A", "Z,Y"], ["B", "X"], ["C", "W,U,V"]], 
    columns=['var1', 'var2'])
df.drop('var2', axis=1).join(
    df.var2.str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('var2'))

因此对于Dask,您可以通过 map_partitions 应用完全相同的方法,因为每一行都独立于所有其他人。如果传递的函数是单独写出的,而不是写成lambda,这看起来会更干净:

so for Dask you can apply exactly the same method via map_partitions, because each row is independent of all others. This maybe would look cleaner if the function passed were written out separately, rather than as a lambda:

d = dd.from_pandas(df, 2)
d.map_partitions(
    lambda df: df.drop('var2', axis=1).join(
        df.var2.str.split(',', expand=True).stack().reset_index(drop=True, level=1).rename('var2')))

如果对此执行了 .compute(),则将获得与上述Pandas情况完全相同的结果。您可能想要像这样一口气计算庞大的数据帧,而是对其进行进一步处理。

if you did .compute() on this, you would get exactly the same result as for the Pandas case above. Likely you will not want to compute your massive dataframe in one go like that, but perform further processing on it.

这篇关于Dask数据框-根据定界符将列拆分为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆