从Pandas DataFrame创建Spark DataFrame [英] Create Spark DataFrame from Pandas DataFrame

查看:339
本文介绍了从Pandas DataFrame创建Spark DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一个简单的Pandas DataFrame构建一个Spark DataFrame.这是我要执行的步骤.

I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow.

import pandas as pd
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.printSchema()

直到这一点,一切都OK.输出为:

Till' this point everything is OK. The output is:


|-字母:字符串(nullable = true)

root
|-- Letters: string (nullable = true)

当我尝试打印DataFrame时出现问题:

The problem comes when I try to print the DataFrame:

spark_df.show()

这是结果:

调用o158.collectToPython时发生错误. : org.apache.spark.SparkException:由于阶段失败,作业中止了: 阶段5.0中的任务0失败1次,最近一次失败:任务0.0丢失 在阶段5.0(TID 5,本地主机,执行程序驱动程序)中: org.apache.spark.SparkException:
来自python worker的错误:
错误 执行Jupyter命令'pyspark.daemon':[Errno 2]没有这样的文件或 PYTHONPATH目录为:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException:pyspark.daemon的端口号不存在 标准输出

An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdout

以下是我的Spark规范:

SparkSession-蜂巢

SparkSession - hive

SparkContext

SparkContext

Spark UI

版本: v2.4.0

Version: v2.4.0

大师: 本地[*]

Master: local[*]

AppName: PySparkShell

AppName: PySparkShell

这是我的venv:

导出PYSPARK_PYTHON = jupyter

export PYSPARK_PYTHON=jupyter

导出PYSPARK_DRIVER_PYTHON_OPTS ='实验室'

export PYSPARK_DRIVER_PYTHON_OPTS='lab'

事实:

正如错误所提到的,这与从Jupyter运行pyspark有关.使用'PYSPARK_PYTHON = python2.7'和'PYSPARK_PYTHON = python3.6'正常运行

As the error mentions, it has to do with running pyspark from Jupyter. Running it with 'PYSPARK_PYTHON=python2.7' and 'PYSPARK_PYTHON=python3.6' works fine

推荐答案

导入并初始化findspark,创建spark会话,然后使用该对象将pandas数据框转换为spark数据框.然后将新的Spark数据框添加到目录中.经过测试并在python 3.6.6的Jupiter 5.7.2和Spyder 3.3.2中运行.

Import and initialise findspark, create a spark session and then use the object to convert the pandas data frame to a spark data frame. Then add the new spark data frame to the catalogue. Tested and runs in both Jupiter 5.7.2 and Spyder 3.3.2 with python 3.6.6.

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
import pandas as pd

# Create a spark session
spark = SparkSession.builder.getOrCreate()

# Create pandas data frame and convert it to a spark data frame 
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = spark.createDataFrame(pandas_df)

# Add the spark data frame to the catalog
spark_df.createOrReplaceTempView('spark_df')

spark_df.show()
+-------+
|Letters|
+-------+
|      X|
|      Y|
|      Z|
+-------+

spark.catalog.listTables()
Out[18]: [Table(name='spark_df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

这篇关于从Pandas DataFrame创建Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆