在 Spark 中,如何仅对前 N 个频繁值进行一次热编码? [英] In Spark, how to do One Hot Encoding for top N frequent values only?

查看:30
本文介绍了在 Spark 中,如何仅对前 N 个频繁值进行一次热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让,在我的数据帧 df 中,我有一列 my_category 在其中我有不同的值,我可以使用以下方法查看值计数:

Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using:

df.groupBy("my_category").count().show()

value   count
a    197
b    166
c    210
d      5
e      2
f      9
g      3

现在,我想在此列上应用单热编码 (OHE),但仅针对顶部 N 频繁值(例如 N = 3),并将所有其余不常用的值放在一个虚拟列中(比如默认").例如,输出应该是这样的:

Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like:

a  b  c  default
0  0  1  0
1  0  0  0
0  1  0  0
1  0  0  0
...
0  0  0  1
0  0  0  1
...

如何在 Spark/Scala 中执行此操作?

注意:我知道如何在 Python 中做到这一点,例如,首先为每个唯一值构建一个基于频繁的字典,然后通过一个一个检查值来创建 OHE 向量,将不频繁的值放在默认"列中.

Note: I know how to do this in Python, e.g., by first building a frequent-based dictionary for each unique value, and then create the OHE vector by examining the values one by one, putting the infrequent ones in a "default" column.

推荐答案

可以将自定义函数编写为,在特定列上应用单一热编码 (OHE),仅针对前 N 个频繁值(例如 N = 3).

A custom function could be written as, to apply One Hot Encoding (OHE) on a particular column, for the top N frequent values only (say N = 3).

和Python比较相似,1)构建一个top n的基于频繁的Dataframe/Dictionary.2) 旋转前 n 个频繁数据帧,即创建 OHE 向量.3) 左连接给定的 Dataframe 和旋转的 Dataframe,将 null 替换为 0,即默认的 OHE 向量.

It is relatively similar to Python, 1) Building a top n frequent-based Dataframe/Dictionary. 2) Pivot the top n frequent Dataframe i.e create the OHE vector. 3) Left join the given Dataframe and pivoted Dataframe, replace null with 0 i.e the default OHE vector.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column

import spark.implicits._
val df = spark
  .sparkContext
  .parallelize(Seq("a", "b", "c", "a", "b", "c", "d", "e", "a", "b", "f", "a", "g", "a", "b", "c", "a", "d", "e", "f", "a", "b", "g", "b", "c", "f", "a", "b", "c"))
  .toDF("value")

val oheEncodedDF = oheEncoding(df, "value", 3)


def oheEncoding(df: DataFrame, colName: String, n: Int): DataFrame = {
  df.createOrReplaceTempView("data")
  val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")

  val pivotTopNDF = topNDF
    .groupBy(colName)
    .pivot(colName)
    .count()
    .withColumn("default", lit(1))

  val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)

  val oheEncodedDF = joinedTopNDF
    .na.fill(0, joinedTopNDF.columns)
    .withColumn("default", flip(col("default")))

   oheEncodedDF
}

def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))

这篇关于在 Spark 中,如何仅对前 N 个频繁值进行一次热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆