如何跨DataFrame中的组使用QuantileDiscretizer? [英] How to use QuantileDiscretizer across groups in a DataFrame?

查看:273
本文介绍了如何跨DataFrame中的组使用QuantileDiscretizer?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含以下各列的DataFrame.

I have a DataFrame with the following columns.

scala> show_times.printSchema
root
 |-- account: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- show_name: string (nullable = true)
 |-- total_time_watched: integer (nullable = true)

这是有关客户观看特定节目的观看次数的数据.我应该根据观看的总时间对每场演出的客户进行分类.

This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched.

数据集总共有1.33亿行,其中有192个不同的show_names.

The dataset has 133 million rows in total with 192 distinct show_names.

对于每场演出,我都将客户分为3类(1、2、3).

For each individual show I'm supposed to bin the customer into 3 categories (1,2,3).

我使用Spark MLlib的

I use Spark MLlib's QuantileDiscretizer

目前,我遍历每个节目并按顺序运行QuantileDiscretizer,如下面的代码所示.

Currently I loop through every show and run QuantileDiscretizer in the sequential manner as in the code below.

最后我想让以下样本输入获得样本输出.

What I'd like to have in the end is for the following sample input to get the sample output.

样本输入:

account,channel,show_name,total_time_watched
acct1,ESPN,show1,200
acct2,ESPN,show1,250
acct3,ESPN,show1,800
acct4,ESPN,show1,850
acct5,ESPN,show1,1300
acct6,ESPN,show1,1320
acct1,ESPN,show2,200
acct2,ESPN,show2,250
acct3,ESPN,show2,800
acct4,ESPN,show2,850
acct5,ESPN,show2,1300
acct6,ESPN,show2,1320

样本输出:

account,channel,show_name,total_time_watched,Time_watched_bin
acct1,ESPN,show1,200,1
acct2,ESPN,show1,250,1
acct3,ESPN,show1,800,2
acct4,ESPN,show1,850,2
acct5,ESPN,show1,1300,3
acct6,ESPN,show1,1320,3
acct1,ESPN,show2,200,1
acct2,ESPN,show2,250,1
acct3,ESPN,show2,800,2
acct4,ESPN,show2,850,2
acct5,ESPN,show2,1300,3
acct6,ESPN,show2,1320,3

是否有一种更有效,更分布式的方式使用类似groupBy的操作来执行此操作,而不是遍历每个show_name并将它们一个接一个地装箱?

Is there a more efficient and distributed way to do it using some groupBy-like operation instead of looping through each show_name and bin it one after other?

推荐答案

这是一个较早的问题.但是,回答该问题将对以后遇到相同情况的人有所帮助.

This is older question. However answering it to help someone with same situation in future.

可以使用熊猫udf函数来实现.熊猫UDF函数的输入和输出都是数据帧.我们需要提供输出数据帧的架构,如下面的代码示例中的注释所示.下面的代码示例可以达到所需的结果.

It can be achieved using pandas udf function. Both input and output of pandas UDF function is dataframe. We need to provide schema of the output dataframe as shown in annotation in below code sample. Below code sample can achieve required result.

output_schema = StructType(df.schema.fields + [StructField('Time_watched_bin', IntegerType(), True)])

@pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
# pdf: pandas dataframe
def get_buckets(pdf):
    pdf['Time_watched_bin'] = pd.cut(pdf['total_time_watched'], 3, labels=False)
        
    return pdf

df = df.groupby('show_name').apply(get_buckets)

df将具有带有存储桶信息的新列"Time_watched_bin".

df will have new column 'Time_watched_bin' with bucket information.

这篇关于如何跨DataFrame中的组使用QuantileDiscretizer?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆