如何在pyspark中创建数据帧的副本? [英] How to create a copy of a dataframe in pyspark?
问题描述
我有一个数据框,我需要通过执行以下操作,从中创建一个新的数据框,并在架构中稍作更改.
<预><代码>>>>X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])>>>schema_new = X.schema.add('id_col', LongType(), False)>>>_X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)问题是在上面的操作中,X
的模式被原地改变了.所以当我打印 X.columns
我得到 p><预><代码>>>>X.列['a', 'b', 'id_col']
但是X
中的值还是一样的
为了避免改变X
的模式,我尝试使用三种方式创建X
的副本- 使用 copy
模块中的 copy
和 deepcopy
方法- 只需使用 _X = X
copy
方法失败并返回一个
RecursionError: 超过最大递归深度
赋值方法也行不通
<预><代码>>>>_X = X>>>id(_X) == id(X)真的由于它们的 id
是相同的,因此创建重复的数据帧在这里并没有真正的帮助,并且在 _X
上完成的操作反映在 X
中.
所以我的问题真的有两个方面
如何替换架构(即不对
X
进行任何更改)?更重要的是,如何创建 pyspark 数据框的副本?
注意:
这个问题是这个帖子的后续
正如在另一个问题的回答中所解释的,您可以制作初始架构的深层副本.然后我们可以修改该副本并使用它来初始化新的 DataFrame
_X
:
import pyspark.sql.functions as F从 pyspark.sql.types 导入 LongType导入副本X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])_schema = copy.deepcopy(X.schema)_schema.add('id_col', LongType(), False) # 就地修改_X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(_schema)
现在让我们检查一下:
print('X 的架构:' + str(X.schema))print('_X 的架构:' + str(_X.schema))
输出:
X 的Schema:StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))_X 的模式: StructType(List(StructField(a,LongType,true),StructField(b,LongType,true),StructField(id_col,LongType,false)))
请注意,要复制 DataFrame
,您只需使用 _X = X
.每当您添加新列时,例如withColumn
,对象没有就地改变,而是返回一个新的副本.希望这会有所帮助!
I have a dataframe from which I need to create a new dataframe with a small change in the schema by doing the following operation.
>>> X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
>>> schema_new = X.schema.add('id_col', LongType(), False)
>>> _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)
The problem is that in the above operation, the schema of X
gets changed inplace. So when I print X.columns
I get
>>> X.columns
['a', 'b', 'id_col']
but the values in X
are still the same
>>> X.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
To avoid changing the schema of X
, I tried creating a copy of X
using three ways
- using copy
and deepcopy
methods from the copy
module
- simply using _X = X
The copy
methods failed and returned a
RecursionError: maximum recursion depth exceeded
The assignment method also doesn't work
>>> _X = X
>>> id(_X) == id(X)
True
Since their id
are the same, creating a duplicate dataframe doesn't really help here and the operations done on _X
reflect in X
.
So my question really is two fold
how to change the schema outplace (that is without making any changes to
X
)?and more importantly, how to create a duplicate of a pyspark dataframe?
Note:
This question is a followup to this post
As explained in the answer to the other question, you could make a deepcopy of your initial schema. We can then modify that copy and use it to initialize the new DataFrame
_X
:
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
import copy
X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
_schema = copy.deepcopy(X.schema)
_schema.add('id_col', LongType(), False) # modified inplace
_X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(_schema)
Now let's check:
print('Schema of X: ' + str(X.schema))
print('Schema of _X: ' + str(_X.schema))
Output:
Schema of X: StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))
Schema of _X: StructType(List(StructField(a,LongType,true),
StructField(b,LongType,true),StructField(id_col,LongType,false)))
Note that to copy a DataFrame
you can just use _X = X
. Whenever you add a new column with e.g. withColumn
, the object is not altered in place, but a new copy is returned.
Hope this helps!
这篇关于如何在pyspark中创建数据帧的副本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!