在Apache Spark中使用pyspark进行数据帧转置 [英] Dataframe transpose with pyspark in Apache Spark
问题描述
我有一个数据帧df
,其结构如下:
I have a dataframe df
that have following structure:
+-----+-----+-----+-------+
| s |col_1|col_2|col_...|
+-----+-----+-----+-------+
| f1 | 0.0| 0.6| ... |
| f2 | 0.6| 0.7| ... |
| f3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
我想计算此数据帧的转置,所以它看起来像
And I want to calculate the transpose of this dataframe so it will be look like
+-------+-----+-----+-------+------+
| s | f1 | f2 | f3 | ...|
+-------+-----+-----+-------+------+
|col_1 | 0.0| 0.6| 0.5 | ...|
|col_2 | 0.6| 0.7| 0.9 | ...|
|col_...| ...| ...| ... | ...|
我捆绑了这两个解决方案,但返回的数据框没有指定的使用方法:
I tied this two solutions but it returns that dataframe has not the specified used method:
方法1:
for x in df.columns:
df = df.pivot(x)
方法2:
df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in df.items()]).toDF()
我该如何解决.
推荐答案
如果数据足够小以至于可以转置(不随聚合而变化),则可以将其转换为Pandas DataFrame
:
If data is small enough to be transposed (not pivoted with aggregation) you can just convert it to Pandas DataFrame
:
df = sc.parallelize([
("f1", 0.0, 0.6, 0.5),
("f2", 0.6, 0.7, 0.9)]).toDF(["s", "col_1", "col_2", "col_3"])
df.toPandas().set_index("s").transpose()
s f1 f2
col_1 0.0 0.6
col_2 0.6 0.7
col_3 0.5 0.9
如果它太大,Spark将无济于事. Spark DataFrame
按行分配数据(尽管在本地使用列式存储),因此单个行的大小仅限于本地内存.
If it is to large for this, Spark won't help. Spark DataFrame
distributes data by row (although locally uses columnar storage), therefore size of a individual rows is limited to local memory.
这篇关于在Apache Spark中使用pyspark进行数据帧转置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!