如何使用pyspark在Spark 2.0中构建sparkSession? [英] How to build a sparkSession in Spark 2.0 using pyspark?

查看:83
本文介绍了如何使用pyspark在Spark 2.0中构建sparkSession?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚可以访问spark 2.0;到目前为止,我一直在使用spark 1.6.1.有人可以帮我使用pyspark(python)设置sparkSession吗?我知道在线上提供的Scala示例是相似的(

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language.

我的具体情况:我正在从齐柏林飞艇(Seppelin)Spark笔记本中的S3加载avro文件.然后构建df并运行各种pyspark& sql查询了他们.我所有的旧查询都使用sqlContext.我知道这是不好的做法,但是我用

My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with

sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate().

我可以使用

mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...

并建立没有问题的数据框.但是,一旦我开始查询数据帧/临时表,就会不断收到"java.lang.NullPointerException"错误.我认为这表明存在翻译错误(例如,旧查询在1.6.1中有效,但需要针对2.0进行调整).无论查询类型如何,都会发生该错误.所以我假设

and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming

1.)sqlContext别名是个坏主意

1.) the sqlContext alias is a bad idea

2.)我需要正确设置sparkSession.

2.) I need to properly set up a sparkSession.

因此,如果有人可以向我展示如何完成此工作,或者可以解释他们知道的不同版本的spark之间的差异,我将不胜感激.请让我知道是否需要详细说明这个问题.我很抱歉,如果它令人费解.

So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.

推荐答案

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

现在要导入一些可以使用的.csv文件

now to import some .csv file you can use

df=spark.read.csv('filename.csv',header=True)

这篇关于如何使用pyspark在Spark 2.0中构建sparkSession?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆