Pyspark:显示数据框列的直方图 [英] Pyspark: show histogram of a data frame column

查看:30
本文介绍了Pyspark:显示数据框列的直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在熊猫数据框中,我使用以下代码绘制列的直方图:

In pandas data frame, I am using the following code to plot histogram of a column:

my_df.hist(column = 'field_1')

在 pyspark 数据框中有什么可以实现相同目标的吗?(我在 Jupyter Notebook)谢谢!

Is there something that can achieve the same goal in pyspark data frame? (I am in Jupyter Notebook) Thanks!

推荐答案

不幸的是,我不认为有一个干净的 plot()hist() 函数在PySpark Dataframes API,但我希望事情最终会朝着这个方向发展.

Unfortunately I don't think that there's a clean plot() or hist() function in the PySpark Dataframes API, but I'm hoping that things will eventually go in that direction.

目前,您可以在 Spark 中计算直方图,并将计算出的直方图绘制为条形图.示例:

For the time being, you could compute the histogram in Spark, and plot the computed histogram as a bar chart. Example:

import pandas as pd
import pyspark.sql as sparksql

# Let's use UCLA's college admission dataset
file_name = "https://stats.idre.ucla.edu/stat/data/binary.csv"

# Creating a pandas dataframe from Sample Data
df_pd = pd.read_csv(file_name)

sql_context = sparksql.SQLcontext(sc)

# Creating a Spark DataFrame from a pandas dataframe
df_spark = sql_context.createDataFrame(df_pd)

df_spark.show(5)

这是数据的样子:

Out[]:    +-----+---+----+----+
          |admit|gre| gpa|rank|
          +-----+---+----+----+
          |    0|380|3.61|   3|
          |    1|660|3.67|   3|
          |    1|800| 4.0|   1|
          |    1|640|3.19|   4|
          |    0|520|2.93|   4|
          +-----+---+----+----+
          only showing top 5 rows


# This is what we want
df_pandas.hist('gre');

使用 df_pandas.hist() 绘制的直方图

# Doing the heavy lifting in Spark. We could leverage the `histogram` function from the RDD api

gre_histogram = df_spark.select('gre').rdd.flatMap(lambda x: x).histogram(11)

# Loading the Computed Histogram into a Pandas Dataframe for plotting
pd.DataFrame(
    list(zip(*gre_histogram)), 
    columns=['bin', 'frequency']
).set_index(
    'bin'
).plot(kind='bar');

使用 RDD.histogram() 计算的直方图

这篇关于Pyspark:显示数据框列的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆