如何从pyspark数据框中检索一列并将其作为新列插入现有pyspark数据框中? [英] how to retrieve a column from pyspark dataframe and and insert it as new column within existing pyspark dataframe?
本文介绍了如何从pyspark数据框中检索一列并将其作为新列插入现有pyspark数据框中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
问题是:
我有一个像这样的pyspark数据框
The problem is: I've got a pyspark dataframe like this
df1:
+--------+
|index |
+--------+
| 121|
| 122|
| 123|
| 124|
| 125|
| 121|
| 121|
| 126|
| 127|
| 120|
| 121|
| 121|
| 121|
| 127|
| 129|
| 132|
| 122|
| 121|
| 121|
| 121|
+--------+
我要从中检索索引列df1并将其插入到现有数据帧df2中(具有相同的长度)。
I want to retrieve index column from df1 and insert it in the existing dataframe df2 ( with same lengths).
df2:
+--------------------+--------------------+
| fact1| fact2|
+--------------------+--------------------+
| 2.4899928731985597|-0.19775025821959014|
| 1.029654847161142| 1.4878188087911541|
| -2.253992428312965| 0.29853121635739804|
| -0.8866000393025826| 0.4032596563578692|
|0.027618408969029146| 0.3218421798358574|
| -3.096711320314157|-0.35825821485752635|
| 3.1758221960731525| -2.0598630487806333|
| 7.401934592245097| -6.359158142708468|
| 1.9954990843859282| 1.9352531243666828|
| 8.728444492631189| -4.644796442599776|
| 3.21061543955211| -1.1472165049607643|
| -0.9619142291174212| -1.2487100946166108|
| 1.0681264788022142| 0.7901514935750167|
| -1.599476182182916| -1.171236788513644|
| 2.657843803002389| 1.456063339439953|
| -1.5683015324294765| -0.6126175010968302|
| -1.6735815834568026| -1.176721177528106|
| -1.4246852948658484| 0.745873761554541|
| 3.7043534046759716| 1.3993120926240652|
| 5.420426369792451| -2.149279759367474|
+--------------------+--------------------+
使用以下三列获取新的df2:索引Fact1,Fact2
to get new df2 with the 3 columns : index Fact1 , Fact2
有什么想法吗?
预先感谢。
推荐答案
import pyspark.sql.functions as f
df1 = sc.parallelize([[121],[122],[123]]).toDF(["index"])
df2 = sc.parallelize([[2.4899928731985597,-0.19775025821959014],[1.029654847161142,1.4878188087911541],
[-2.253992428312965,0.29853121635739804]]).toDF(["fact1","fact2"])
# since there is no common column between these two dataframes add row_index so that it can be joined
df1=df1.withColumn('row_index', f.monotonically_increasing_id())
df2=df2.withColumn('row_index', f.monotonically_increasing_id())
df2 = df2.join(df1, on=["row_index"]).sort("row_index").drop("row_index")
df2.show()
别忘了告诉我们它是否解决了您的问题:)
这篇关于如何从pyspark数据框中检索一列并将其作为新列插入现有pyspark数据框中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文