镶木地板文件是否保留 Spark DataFrames 的行顺序? [英] Do parquet files preserve the row order of Spark DataFrames?
问题描述
当我将 Spark DataFrame 保存为 parquet 文件然后读回它时,生成的 DataFrame 的行与原始数据不同,如下面的会话所示.这是 DataFrames 或镶木地板文件的功能"吗?以保留行顺序的方式保存 DataFrame 的最佳方法是什么?
<预><代码>>>>将 numpy 导入为 np>>>将熊猫导入为 pd>>>pdf = pd.DataFrame(np.random.random((10,2)))>>>pdf0 10 0.191519 0.6221091 0.437728 0.7853592 0.779976 0.2725933 0.276464 0.8018724 0.958139 0.8759335 0.357817 0.5009956 0.683463 0.7127027 0.370251 0.5611968 0.503083 0.0137689 0.772827 0.882641>>>df = sqlContext.createDataFrame(pdf)>>>df.show()+--------------------+--------------------+|0|1|+--------------------+--------------------+|0.1915194503788923|0.6221087710398319||0.4377277390071145|0.7853585837137692||0.7799758081188035|0.2725926052826416||0.2764642551430967|0.8018721775350193||0.9581393536837052|0.8759326347420947||0.35781726995786667|0.5009951255234587||0.6834629351721363|0.7127020269829002||0.37025075479039493|0.5611961860656249||0.5030831653078097|0.013768449590682241||0.772826621612374|0.8826411906361166|+--------------------+--------------------+>>>df.write.parquet('test.parquet')>>>df2 = sqlContext.read.parquet('test.parquet')>>>df2.show()+--------------------+--------------------+|0|1|+--------------------+--------------------+|0.6834629351721363|0.7127020269829002||0.37025075479039493|0.5611961860656249||0.5030831653078097|0.013768449590682241||0.772826621612374|0.8826411906361166||0.7799758081188035|0.2725926052826416||0.2764642551430967|0.8018721775350193||0.1915194503788923|0.6221087710398319||0.4377277390071145|0.7853585837137692||0.9581393536837052|0.8759326347420947||0.35781726995786667|0.5009951255234587|+--------------------+--------------------+这看起来像是 Spark 内部分区的结果(以及 show()
的实现).函数 show()
本质上围绕对 take()
的调用封装了一些漂亮的格式,并且对 take 的工作原理有一个很好的解释 此处.由于两次调用 show()
的初始读取分区可能不同,因此您将看到不同的值.
When I save a Spark DataFrame as a parquet file then read it back, the rows of the resulting DataFrame are not the same as the original as shown in the session below. Is this a "feature" of DataFrames or of parquet files? What would be the best way to save a DataFrame in a row-order preserving manner?
>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(np.random.random((10,2)))
>>> pdf
0 1
0 0.191519 0.622109
1 0.437728 0.785359
2 0.779976 0.272593
3 0.276464 0.801872
4 0.958139 0.875933
5 0.357817 0.500995
6 0.683463 0.712702
7 0.370251 0.561196
8 0.503083 0.013768
9 0.772827 0.882641
>>> df = sqlContext.createDataFrame(pdf)
>>> df.show()
+-------------------+--------------------+
| 0| 1|
+-------------------+--------------------+
| 0.1915194503788923| 0.6221087710398319|
| 0.4377277390071145| 0.7853585837137692|
| 0.7799758081188035| 0.2725926052826416|
| 0.2764642551430967| 0.8018721775350193|
| 0.9581393536837052| 0.8759326347420947|
|0.35781726995786667| 0.5009951255234587|
| 0.6834629351721363| 0.7127020269829002|
|0.37025075479039493| 0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
| 0.772826621612374| 0.8826411906361166|
+-------------------+--------------------+
>>> df.write.parquet('test.parquet')
>>> df2 = sqlContext.read.parquet('test.parquet')
>>> df2.show()
+-------------------+--------------------+
| 0| 1|
+-------------------+--------------------+
| 0.6834629351721363| 0.7127020269829002|
|0.37025075479039493| 0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
| 0.772826621612374| 0.8826411906361166|
| 0.7799758081188035| 0.2725926052826416|
| 0.2764642551430967| 0.8018721775350193|
| 0.1915194503788923| 0.6221087710398319|
| 0.4377277390071145| 0.7853585837137692|
| 0.9581393536837052| 0.8759326347420947|
|0.35781726995786667| 0.5009951255234587|
+-------------------+--------------------+
This looks like it's the result of partitioning within Spark (as well as the implementation for show()
). The function show()
essentially wraps some pretty formatting around a call to take()
and there is a good explanation as to how take works here. Since the initially read partitions may not be the same across both calls to show()
, you will see different values.
这篇关于镶木地板文件是否保留 Spark DataFrames 的行顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!