读取 pandas 数据框时出现pyspark类型错误 [英] pyspark type error on reading a pandas dataframe
问题描述
我将一些CSV文件读入了pandas中,对其进行了很好的预处理,并将dtypes设置为float,int,category的所需值.但是,当尝试将其导入spark时,出现以下错误:
I read some CSV file into pandas, nicely preprocessed it and set dtypes to desired values of float, int, category. However, when trying to import it into spark I get the following error:
Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
尝试跟踪一段时间后,我发现了一些麻烦的来源->请参阅CSV文件:
After trying to trace it for a while I some source for my troubles -> see the CSV file:
"myColumns"
""
"A"
红色变成大熊猫,例如:small = pd.read_csv(os.path.expanduser('myCsv.csv'))
Red into pandas like: small = pd.read_csv(os.path.expanduser('myCsv.csv'))
并且无法导入以触发:
sparkDF = spark.createDataFrame(small)
当前我使用Spark 2.0.0
Currently I use Spark 2.0.0
推荐答案
您需要显式定义spark DataFrame
模式,并将其传递给createDataFrame
函数:
You'll need to define the spark DataFrame
schema explicitly and pass it to the createDataFrame
function :
from pyspark.sql.types import *
import pandas as pd
small = pdf.read_csv("data.csv")
small.head()
# myColumns
# 0 NaN
# 1 A
sch = StructType([StructField("myColumns", StringType(), True)])
df = spark.createDataFrame(small, sch)
df.show()
# +---------+
# |myColumns|
# +---------+
# | NaN|
# | A|
# +---------+
df.printSchema()
# root
# |-- myColumns: string (nullable = true)
这篇关于读取 pandas 数据框时出现pyspark类型错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!