将 Pandas 数据帧转换为 Spark 数据帧错误 [英] Converting Pandas dataframe into Spark dataframe error

查看:55
本文介绍了将 Pandas 数据帧转换为 Spark 数据帧错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Pandas DF 转换为 Spark 一个.DF头:

I'm trying to convert Pandas DF into Spark one. DF head:

10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543
10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611
10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691

代码:

dataset = pd.read_csv("data/AS/test_v2.csv")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(dataset)

我遇到了一个错误:

TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

推荐答案

您需要确保您的 Pandas 数据框列适合 spark 推断的类型.如果您的 Pandas 数据框列出如下内容:

You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring. If your pandas dataframe lists something like:

pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5062 entries, 0 to 5061
Data columns (total 51 columns):
SomeCol                    5062 non-null object
Col2                       5062 non-null object

然后你得到那个错误试试:

And you're getting that error try:

df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)

现在,确保 .astype(str) 实际上是您希望这些列的类型.基本上,当底层 Java 代码尝试从 Python 中的对象推断类型时,它会使用一些观察并进行猜测,如果该猜测不适用于列中的所有数据,则它会尝试从 Pandas 转换为spark它会失败.

Now, make sure .astype(str) is actually the type you want those columns to be. Basically, when the underlying Java code tries to infer the type from an object in python it uses some observations and makes a guess, if that guess doesn't apply to all the data in the column(s) it's trying to convert from pandas to spark it will fail.

这篇关于将 Pandas 数据帧转换为 Spark 数据帧错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆