从 Pandas DataFrame 创建 Spark DataFrame [英] Create Spark DataFrame from Pandas DataFrame
问题描述
我正在尝试从一个简单的 Pandas DataFrame 构建一个 Spark DataFrame.这是我遵循的步骤.
I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow.
import pandas as pd
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.printSchema()
直到现在一切正常.输出为:
Till' this point everything is OK. The output is:
根
|-- 字母:字符串 (nullable = true)
root
|-- Letters: string (nullable = true)
当我尝试打印 DataFrame 时出现问题:
The problem comes when I try to print the DataFrame:
spark_df.show()
结果如下:
调用 o158.collectToPython 时出错.:org.apache.spark.SparkException:由于阶段失败,作业中止:阶段 5.0 中的任务 0 失败 1 次,最近失败:丢失任务 0.0在 5.0 阶段(TID 5、本地主机、执行程序驱动程序):org.apache.spark.SparkException:
来自 python 工人的错误:
错误执行 Jupyter 命令pyspark.daemon":[Errno 2] 没有这样的文件或目录 PYTHONPATH 是:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/:org.apache.spark.SparkException:pyspark.daemon 中没有端口号标准输出
An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
这些是我的 Spark 规范:
SparkSession - 蜂巢
SparkSession - hive
SparkContext
SparkContext
Spark 界面
版本:v2.4.0
师父:本地[*]
应用名称:PySparkShell
AppName: PySparkShell
这是我的 venv:
导出 PYSPARK_PYTHON=jupyter
export PYSPARK_PYTHON=jupyter
导出 PYSPARK_DRIVER_PYTHON_OPTS='lab'
export PYSPARK_DRIVER_PYTHON_OPTS='lab'
事实:
正如错误所提到的,它与从 Jupyter 运行 pyspark 有关.使用 'PYSPARK_PYTHON=python2.7' 和 'PYSPARK_PYTHON=python3.6' 运行它可以正常工作
As the error mentions, it has to do with running pyspark from Jupyter. Running it with 'PYSPARK_PYTHON=python2.7' and 'PYSPARK_PYTHON=python3.6' works fine
推荐答案
导入并初始化 findspark,创建一个 Spark 会话,然后使用该对象将 Pandas 数据框转换为 Spark 数据框.然后将新的 spark 数据框添加到目录中.在 Jupiter 5.7.2 和 Spyder 3.3.2 和 python 3.6.6 中测试并运行.
Import and initialise findspark, create a spark session and then use the object to convert the pandas data frame to a spark data frame. Then add the new spark data frame to the catalogue. Tested and runs in both Jupiter 5.7.2 and Spyder 3.3.2 with python 3.6.6.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
# Create a spark session
spark = SparkSession.builder.getOrCreate()
# Create pandas data frame and convert it to a spark data frame
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = spark.createDataFrame(pandas_df)
# Add the spark data frame to the catalog
spark_df.createOrReplaceTempView('spark_df')
spark_df.show()
+-------+
|Letters|
+-------+
| X|
| Y|
| Z|
+-------+
spark.catalog.listTables()
Out[18]: [Table(name='spark_df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
这篇关于从 Pandas DataFrame 创建 Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!