如何使用 Spark (pyspark) 编写镶木地板文件? [英] How can I write a parquet file using Spark (pyspark)?
问题描述
我是 Spark 的新手,我一直在尝试将 Dataframe 转换为 Spark 中的镶木地板文件,但还没有成功.文档 说我可以使用 write.parquet 函数来创建文件.但是,当我运行脚本时,它向我显示:AttributeError: 'RDD' object has no attribute 'write'
I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'
from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")
# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")
你知道怎么做吗?
我使用的 spark 版本是为 Hadoop 2.7.3 构建的 Spark 2.0.1.
The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.
推荐答案
该错误是由于 SparkContext
中的 textFile
方法返回了一个 RDD
而我需要的是一个 DataFrame
.
The error was due to the fact that the textFile
method from SparkContext
returned an RDD
and what I needed was a DataFrame
.
SparkSession 在底层有一个 SQLContext
.所以我需要使用 DataFrameReader
正确读取 CSV 文件,然后再将其转换为镶木地板文件.
SparkSession has a SQLContext
under the hood. So I needed to use the DataFrameReader
to read the CSV file correctly before converting it to a parquet file.
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# read csv
df = spark.read.csv("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.show()
df.write.parquet("output/proto.parquet")
这篇关于如何使用 Spark (pyspark) 编写镶木地板文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!