pyspark中的Pandas DataFrame到配置单元 [英] Pandas dataframe in pyspark to hive

查看:386
本文介绍了pyspark中的Pandas DataFrame到配置单元的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将熊猫数据框发送到配置单元表?

How to send a pandas dataframe to a hive table?

我知道我是否有spark数据框,可以使用

I know if I have a spark dataframe, I can register it to a temporary table using

df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")

但是当我尝试使用pandas dataFrame注册registerTempTable时,出现以下错误:

but when I try to use the pandas dataFrame to registerTempTable, I get the below error:

AttributeError: 'DataFrame' object has no attribute 'registerTempTable'

我是否可以使用pandas dataFrame注册临时表或将其转换为spark dataFrame,然后使用它注册临时表,以便将其发送回配置单元.

Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.

推荐答案

我猜您正在尝试使用熊猫df而不是

I guess you are trying to use pandas df instead of Spark's DF.

Pandas DataFrame没有像registerTempTable这样的方法.

Pandas DataFrame has no such method as registerTempTable.

您可以尝试从pandas DF创建Spark DF.

you may try to create Spark DF from pandas DF.

更新:

我已经在Cloudera(已安装 Anaconda包裹)下对其进行了测试.熊猫模块).

I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).

确保已在所有Spark工作者(通常在spark-conf/spark-env.sh中)将anconda python安装(或另一个包含Pandas模块的安装)设置为PYSPARK_PYTHON

Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)

这是我测试的结果:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
|  A|  B|  C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+

>>> sdf.printSchema()
root
 |-- A: long (nullable = true)
 |-- B: long (nullable = true)
 |-- C: long (nullable = true)

这篇关于pyspark中的Pandas DataFrame到配置单元的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆