从 Pandas DataFrame 创建 Spark DataFrame [英] Create Spark DataFrame from Pandas DataFrame

查看:67
本文介绍了从 Pandas DataFrame 创建 Spark DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一个简单的 Pandas DataFrame 构建一个 Spark DataFrame.这是我遵循的步骤.

I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow.

import pandas as pd
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.printSchema()

直到现在一切正常.输出为:

Till' this point everything is OK. The output is:


|-- 字母:字符串 (nullable = true)

root
|-- Letters: string (nullable = true)

当我尝试打印 DataFrame 时出现问题:

The problem comes when I try to print the DataFrame:

spark_df.show()

结果如下:

调用 o158.collectToPython 时出错.:org.apache.spark.SparkException:由于阶段失败,作业中止:阶段 5.0 中的任务 0 失败 1 次,最近失败:丢失任务 0.0在 5.0 阶段(TID 5、本地主机、执行程序驱动程序):org.apache.spark.SparkException:
来自 python 工人的错误:
错误执行 Jupyter 命令pyspark.daemon":[Errno 2] 没有这样的文件或目录 PYTHONPATH 是:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/:org.apache.spark.SparkException:pyspark.daemon 中没有端口号标准输出

An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdout

这些是我的 Spark 规范:

SparkSession - 蜂巢

SparkSession - hive

SparkContext

SparkContext

Spark 界面

版本:v2.4.0

师父:本地[*]

应用名称:PySparkShell

AppName: PySparkShell

这是我的 venv:

导出 PYSPARK_PYTHON=jupyter

export PYSPARK_PYTHON=jupyter

导出 PYSPARK_DRIVER_PYTHON_OPTS='lab'

export PYSPARK_DRIVER_PYTHON_OPTS='lab'

事实:

正如错误所提到的,它与从 Jupyter 运行 pyspark 有关.使用 'PYSPARK_PYTHON=python2.7' 和 'PYSPARK_PYTHON=python3.6' 运行它可以正常工作

As the error mentions, it has to do with running pyspark from Jupyter. Running it with 'PYSPARK_PYTHON=python2.7' and 'PYSPARK_PYTHON=python3.6' works fine

推荐答案

导入并初始化 findspark,创建一个 Spark 会话,然后使用该对象将 Pandas 数据框转换为 Spark 数据框.然后将新的 spark 数据框添加到目录中.在 Jupiter 5.7.2 和 Spyder 3.3.2 和 python 3.6.6 中测试并运行.

Import and initialise findspark, create a spark session and then use the object to convert the pandas data frame to a spark data frame. Then add the new spark data frame to the catalogue. Tested and runs in both Jupiter 5.7.2 and Spyder 3.3.2 with python 3.6.6.

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
import pandas as pd

# Create a spark session
spark = SparkSession.builder.getOrCreate()

# Create pandas data frame and convert it to a spark data frame 
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = spark.createDataFrame(pandas_df)

# Add the spark data frame to the catalog
spark_df.createOrReplaceTempView('spark_df')

spark_df.show()
+-------+
|Letters|
+-------+
|      X|
|      Y|
|      Z|
+-------+

spark.catalog.listTables()
Out[18]: [Table(name='spark_df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

这篇关于从 Pandas DataFrame 创建 Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆