为什么DataFrame中缺少分区键列 [英] Why is partition key column missing from DataFrame

查看：115 发布时间：2020/9/4 4:00:16 python apache-spark pyspark

本文介绍了为什么DataFrame中缺少分区键列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一份工作，加载一个DataFrame对象，然后使用DataFrame partitionBy方法将数据保存为镶木地板格式.然后，我发布创建的路径，以便后续作业可以使用输出.输出中的路径如下所示:

I have a job which loads a DataFrame object and then saves the data to parquet format using the DataFrame partitionBy method. Then I publish the paths created so a subsequent job can use the output. The paths in the output would look like this:

/ptest/_SUCCESS
/ptest/id=0
/ptest/id=0/part-00000-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=0/part-00001-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=0/part-00002-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=1
/ptest/id=1/part-00003-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=1/part-00004-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=1/part-00005-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=3
/ptest/id=3/part-00006-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet
/ptest/id=3/part-00007-942fb247-1fe4-4147-a41a-bc688f932862.snappy.parquet

当我收到新数据时，它会附加到数据集中.路径已发布，因此依赖于数据的作业可以只处理新数据.

When I receive new data it is appended to the dataset. The paths are published so jobs which depend on the data can just process the new data.

这是代码的简化示例:

>>> rdd = sc.parallelize([(0,1,"A"), (0,1,"B"), (0,2,"C"), (1,2,"D"), (1,10,"E"), (1,20,"F"), (3,18,"G"), (3,18,"H"), (3,18,"I")])
>>> df = sqlContext.createDataFrame(rdd, ["id", "score","letter"])
>>> df.show()
+---+-----+------+
| id|score|letter|
+---+-----+------+
|  0|    1|     A|
|  0|    1|     B|
|  0|    2|     C|
|  1|    2|     D|
|  1|   10|     E|
|  1|   20|     F|
|  3|   18|     G|
|  3|   18|     H|
|  3|   18|     I|
+---+-----+------+
>>> df.write.partitionBy("id").format("parquet").save("hdfs://localhost:9000/ptest")

问题是当另一个作业尝试使用发布的路径读取文件时:

The problem is when another job tries to read the file using the published paths:

>>> df2 = spark.read.format("parquet").schema(df2.schema).load("hdfs://localhost:9000/ptest/id=0/")
>>> df2.show()
+-----+------+
|score|letter|
+-----+------+
|    1|     A|
|    1|     B|
|    2|     C|
+-----+------+

您可以看到已加载的数据集中缺少分区键.如果要发布作业可以使用的架构，则可以使用该架构加载文件.文件已加载且分区键已存在，但值为空:

As you can see the partition key is missing from the loaded dataset. If I were to publish a schema that jobs could use I can load the file using the schema. The file loads and the partition key exists, but the values are null:

>>> df2 = spark.read.format("parquet").schema(df.schema).load("hdfs://localhost:9000/ptest/id=0/")
>>> df2.show()
+----+-----+------+
|  id|score|letter|
+----+-----+------+
|null|    1|     A|
|null|    1|     B|
|null|    2|     C|
+----+-----+------+

是否有一种方法可以确保将分区键存储在实木复合地板数据中?我不想要求其他进程来解析路径以获取密钥.

Is there a way to make sure the partition keys are stored w/in the parquet data? I don't want to require other processes to parse the paths to get the keys.

为什么DataFrame中缺少分区键列 [英] Why is partition key column missing from DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么DataFrame中缺少分区键列 [英] Why is partition key column missing from DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭