如何阅读Spark编写的PySpark中的实木复合地板? [英] How do I read a parquet in PySpark written from Spark?

查看:124
本文介绍了如何阅读Spark编写的PySpark中的实木复合地板?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分析中,我正在使用两个Jupyter笔记本做不同的事情。在我的Scala笔记本中,我将一些已清理的数据写到了实木复合地板上:

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

然后我进入Python笔记本以读取数据:

I then go to my Python notebook to read in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

,我收到以下错误消息:

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

我看过spark文档,但我不认为应该要求指定一个架构。有没有人遇到过这样的事情?保存/加载时是否应该做其他事情?数据正在存储在对象存储中。

I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.

编辑:
我在读取和写入中都唱spark 2.0。

edit: I'm sing spark 2.0 in both the read and the write.

edit2:
这是在Data Science Experience中的一个项目中完成的。

edit2: This was done in a project in Data Science Experience.

推荐答案

我以下列方式读取镶木地板文件:

I read parquet file in the following way:

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# to read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

这篇关于如何阅读Spark编写的PySpark中的实木复合地板?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆