将 Pandas 数据帧转换为 Spark 数据帧错误 [英] Converting Pandas dataframe into Spark dataframe error
问题描述
我正在尝试将 Pandas DF 转换为 Spark 一个.DF头:
I'm trying to convert Pandas DF into Spark one. DF head:
10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543
10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611
10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691
代码:
dataset = pd.read_csv("data/AS/test_v2.csv")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(dataset)
我遇到了一个错误:
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>
推荐答案
您需要确保您的 Pandas 数据框列适合 spark 推断的类型.如果您的 Pandas 数据框列出如下内容:
You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring. If your pandas dataframe lists something like:
pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5062 entries, 0 to 5061
Data columns (total 51 columns):
SomeCol 5062 non-null object
Col2 5062 non-null object
然后你得到那个错误试试:
And you're getting that error try:
df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)
现在,确保 .astype(str)
实际上是您希望这些列的类型.基本上,当底层 Java 代码尝试从 Python 中的对象推断类型时,它会使用一些观察并进行猜测,如果该猜测不适用于列中的所有数据,则它会尝试从 Pandas 转换为spark它会失败.
Now, make sure .astype(str)
is actually the type you want those columns to be. Basically, when the underlying Java code tries to infer the type from an object in python it uses some observations and makes a guess, if that guess doesn't apply to all the data in the column(s) it's trying to convert from pandas to spark it will fail.
这篇关于将 Pandas 数据帧转换为 Spark 数据帧错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!