连接两个数据帧pyspark [英] Concatenate two dataframes pyspark
本文介绍了连接两个数据帧pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试连接两个看起来像这样的数据框:
I'm trying to concatenate two dataframes, which look like that:
df1:
+---+---+
| a| b|
+---+---+
| a| b|
| 1| 2|
+---+---+
only showing top 2 rows
df2:
+---+---+
| c| d|
+---+---+
| c| d|
| 7| 8|
+---+---+
only showing top 2 rows
它们都具有相同的行数,我想这样做:
They both have the same number of rows, and I would like to do something like:
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| a| b| c| d|
| 1| 2| 7| 8|
+---+---+---+---+
我试过了:
df1=df1.withColumn('c', df2.c).collect()
df1=df1.withColumn('d', df2.d).collect()
但是没有成功,给我这个错误:
But without success, gives me this error:
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o2804.withColumn.
有办法吗?
谢谢
推荐答案
这里是@Suresh建议的示例,添加列行号
Here is example of @Suresh proposal, add column rownumber
from pyspark.sql import functions as F
df1 = sqlctx.createDataFrame([('a','b'),('1','2')],['a','b']).withColumn("row_number", F.row_number().over(Window.partitionBy().orderBy("a")))
df2 = sqlctx.createDataFrame([('c','d'),('7','8')],['c','d']).withColumn("row_number", F.row_number().over(Window.partitionBy().orderBy("c")))
df3=df1.join(df2,df1.row_number==df2.row_number,'inner')\
.select(df1.a,df1.b,df2.c,df2.d)
df3=df1.join(df2,df1.row_number==df2.row_number,'inner').select(df1.a,df1.b,df2.c,df2.d)
df3.show()
这篇关于连接两个数据帧pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文