Python Dask-2个DataFrame的垂直串联 [英] Python Dask - vertical concatenation of 2 DataFrames

查看:286
本文介绍了Python Dask-2个DataFrame的垂直串联的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试垂直连接两个Dask DataFrame

I am trying to vertically concatenate two Dask DataFrames

我有以下Dask DataFrame:

I have the following Dask DataFrame:

d = [
    ['A','B','C','D','E','F'],
    [1, 4, 8, 1, 3, 5],
    [6, 6, 2, 2, 0, 0],
    [9, 4, 5, 0, 6, 35],
    [0, 1, 7, 10, 9, 4],
    [0, 7, 2, 6, 1, 2]
    ]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)

以下是作为Pandas DataFrame的数据

Here is the data as a Pandas DataFrame

          A         B      C      D      E      F
0         1         4      8      1      3      5
1         6         6      2      2      0      0
2         9         4      5      0      6     35
3         0         1      7     10      9      4
4         0         7      2      6      1      2

这里是Dask数据框

Dask DataFrame Structure:
                   A      B      C      D      E      F
npartitions=4                                          
0              int64  int64  int64  int64  int64  int64
1                ...    ...    ...    ...    ...    ...
2                ...    ...    ...    ...    ...    ...
3                ...    ...    ...    ...    ...    ...
4                ...    ...    ...    ...    ...    ...
Dask Name: from_pandas, 4 tasks

我正在尝试垂直连接2个Dask DataFrame:

I am trying to concatenate 2 Dask DataFrames vertically:

ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)

但我收到此错误:

Traceback (most recent call last):
      ...
      File "...", line 572, in concat
        raise ValueError('All inputs have known divisions which cannot '
    ValueError: All inputs have known divisions which cannot be concatenated
    in order. Specify interleave_partitions=True to ignore order

但是,如果我尝试:

dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)

然后它似乎正在工作。将此设置为 True 是否存在问题(就性能而言-速度)?还是有另外一种垂直2个串联Dask DataFrame的方法?

then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?

推荐答案

如果您检查数据框的划分 ddf.divisions ,假设一个分区,它的索引边缘为(0,4)。这样做很有用,因为它知道您何时对数据进行某些操作,而不要使用不包含所需索引值的分区。这也是为什么当索引适合该作业时某些快速操作会更快的原因。

If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.

当您连接时,第二个数据帧的索引与第一个数据帧的索引相同。如果索引的值在两个分区中具有不同的范围,则串联将不会交错进行。

When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.

这篇关于Python Dask-2个DataFrame的垂直串联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆