Spark将Pandas df转换为S3 [英] Spark converting Pandas df to S3

查看:105
本文介绍了Spark将Pandas df转换为S3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在使用Spark和Pandas框架.我如何以一种可以写入s3的便捷方式转换Pandas Dataframe.

Currently i am using Spark along with Pandas framework. How can I convert Pandas Dataframe in a convenient way which can be written to s3.

我尝试了以下选项,但由于df是Pandas数据框,并且没有写入选项,因此出现错误.

I have tried below option but I get error as df is Pandas dataframe and it has no write option.

df.write()
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .save("123.csv");

推荐答案

在Spark中运行该方法时,一种方法是将Pandas DataFrame转换为Spark DataFrame,然后将其保存到S3.

As you are running this in Spark, one approach would be to convert the Pandas DataFrame into a Spark DataFrame and then save this to S3.

下面的代码段创建了pdf Pandas DataFrame ,并将其转换为df Spark DataFrame.

The code snippet below creates the pdf Pandas DataFrame and converts it into the df Spark DataFrame.

import numpy as np
import pandas as pd

# Create Pandas DataFrame
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pdf = pd.DataFrame(d)

# Convert Pandas DataFrame to Spark DataFrame
df = spark.createDataFrame(pdf)
df.printSchema()

为进行验证,我们还可以使用以下输出打印出Spark DataFrame的架构.

To validate, we can also print out the schema for the Spark DataFrame with the output below.

root
 |-- one: double (nullable = true)
 |-- two: double (nullable = true)

现在它是一个Spark DataFrame,您可以使用spark-csv包通过以下示例保存文件.

Now that it is a Spark DataFrame, you can use the spark-csv package to save the file with the example below.

# Save Spark DataFrame to S3
df.write.format('com.databricks.spark.csv').options(header='true').save('123.csv')

这篇关于Spark将Pandas df转换为S3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆