如何在pyspark中创建数据框的副本? [英] How to create a copy of a dataframe in pyspark?

查看：98 发布时间：2020/9/4 7:59:29 python apache-spark pyspark apache-spark-sql

本文介绍了如何在pyspark中创建数据框的副本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，需要通过以下操作来创建一个新的数据框，其架构中的更改很小.

I have a dataframe from which I need to create a new dataframe with a small change in the schema by doing the following operation.

>>> X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
>>> schema_new = X.schema.add('id_col', LongType(), False)
>>> _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)

问题在于，在上述操作中，X的架构被就地更改了.因此，当我打印X.columns时，我会得到

The problem is that in the above operation, the schema of X gets changed inplace. So when I print X.columns I get

>>> X.columns
['a', 'b', 'id_col']

但X中的值仍然相同

>>> X.show()
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

为避免更改X的架构，我尝试使用三种方法创建X的副本 -使用copy模块中的copy和deepcopy方法 -只需使用_X = X

To avoid changing the schema of X, I tried creating a copy of X using three ways - using copy and deepcopy methods from the copy module - simply using _X = X

copy方法失败并返回

RecursionError: maximum recursion depth exceeded

分配方法也不起作用

>>> _X = X
>>> id(_X) == id(X)
True

由于它们的id相同，因此在这里创建重复的数据帧并没有真正的帮助，并且在_X上执行的操作会反映在X中.

Since their id are the same, creating a duplicate dataframe doesn't really help here and the operations done on _X reflect in X.

所以我的问题确实有两个方面

So my question really is two fold

如何更改架构范围(即不对X进行任何更改)?

，更重要的是，如何创建pyspark数据框的副本?

and more importantly, how to create a duplicate of a pyspark dataframe?

注意:

此问题是此帖子的后续操作

This question is a followup to this post

推荐答案

如对另一个问题的回答中所述，您可以对初始模式进行深度复制.然后，我们可以修改该副本并将其用于初始化新的DataFrame _X:

As explained in the answer to the other question, you could make a deepcopy of your initial schema. We can then modify that copy and use it to initialize the new DataFrame _X:

import pyspark.sql.functions as F
from pyspark.sql.types import LongType
import copy

X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
_schema = copy.deepcopy(X.schema)
_schema.add('id_col', LongType(), False) # modified inplace
_X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(_schema)

现在让我们检查一下:

print('Schema of X: ' + str(X.schema))
print('Schema of _X: ' + str(_X.schema))

输出:

Schema of X: StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))
Schema of _X: StructType(List(StructField(a,LongType,true),
                  StructField(b,LongType,true),StructField(id_col,LongType,false)))

请注意，要复制DataFrame，您只能使用_X = X.每当您添加新列时，例如withColumn，该对象未就地更改，但返回了新副本. 希望这会有所帮助！

Note that to copy a DataFrame you can just use _X = X. Whenever you add a new column with e.g. withColumn, the object is not altered in place, but a new copy is returned. Hope this helps!

这篇关于如何在pyspark中创建数据框的副本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在pyspark中创建数据框的副本? [英] How to create a copy of a dataframe in pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在pyspark中创建数据框的副本? [英] How to create a copy of a dataframe in pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭