镶木地板文件是否保留 Spark DataFrames 的行顺序? [英] Do parquet files preserve the row order of Spark DataFrames?

查看:23
本文介绍了镶木地板文件是否保留 Spark DataFrames 的行顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我将 Spark DataFrame 保存为 parquet 文件然后读回它时,生成的 DataFrame 的行与原始数据不同,如下面的会话所示.这是 DataFrames 或镶木地板文件的功能"吗?以保留行顺序的方式保存 DataFrame 的最佳方法是什么?

<预><代码>>>>将 numpy 导入为 np>>>将熊猫导入为 pd>>>pdf = pd.DataFrame(np.random.random((10,2)))>>>pdf0 10 0.191519 0.6221091 0.437728 0.7853592 0.779976 0.2725933 0.276464 0.8018724 0.958139 0.8759335 0.357817 0.5009956 0.683463 0.7127027 0.370251 0.5611968 0.503083 0.0137689 0.772827 0.882641>>>df = sqlContext.createDataFrame(pdf)>>>df.show()+--------------------+--------------------+|0|1|+--------------------+--------------------+|0.1915194503788923|0.6221087710398319||0.4377277390071145|0.7853585837137692||0.7799758081188035|0.2725926052826416||0.2764642551430967|0.8018721775350193||0.9581393536837052|0.8759326347420947||0.35781726995786667|0.5009951255234587||0.6834629351721363|0.7127020269829002||0.37025075479039493|0.5611961860656249||0.5030831653078097|0.013768449590682241||0.772826621612374|0.8826411906361166|+--------------------+--------------------+>>>df.write.parquet('test.parquet')>>>df2 = sqlContext.read.parquet('test.parquet')>>>df2.show()+--------------------+--------------------+|0|1|+--------------------+--------------------+|0.6834629351721363|0.7127020269829002||0.37025075479039493|0.5611961860656249||0.5030831653078097|0.013768449590682241||0.772826621612374|0.8826411906361166||0.7799758081188035|0.2725926052826416||0.2764642551430967|0.8018721775350193||0.1915194503788923|0.6221087710398319||0.4377277390071145|0.7853585837137692||0.9581393536837052|0.8759326347420947||0.35781726995786667|0.5009951255234587|+--------------------+--------------------+

解决方案

这看起来像是 Spark 内部分区的结果(以及 show() 的实现).函数 show() 本质上围绕对 take() 的调用封装了一些漂亮的格式,并且对 take 的工作原理有一个很好的解释 此处.由于两次调用 show() 的初始读取分区可能不同,因此您将看到不同的值.

When I save a Spark DataFrame as a parquet file then read it back, the rows of the resulting DataFrame are not the same as the original as shown in the session below. Is this a "feature" of DataFrames or of parquet files? What would be the best way to save a DataFrame in a row-order preserving manner?

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(np.random.random((10,2)))
>>> pdf
          0         1
0  0.191519  0.622109
1  0.437728  0.785359
2  0.779976  0.272593
3  0.276464  0.801872
4  0.958139  0.875933
5  0.357817  0.500995
6  0.683463  0.712702
7  0.370251  0.561196
8  0.503083  0.013768
9  0.772827  0.882641
>>> df = sqlContext.createDataFrame(pdf)
>>> df.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
+-------------------+--------------------+
>>> df.write.parquet('test.parquet')
>>> df2 = sqlContext.read.parquet('test.parquet')
>>> df2.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
+-------------------+--------------------+

解决方案

This looks like it's the result of partitioning within Spark (as well as the implementation for show()). The function show() essentially wraps some pretty formatting around a call to take() and there is a good explanation as to how take works here. Since the initially read partitions may not be the same across both calls to show(), you will see different values.

这篇关于镶木地板文件是否保留 Spark DataFrames 的行顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆