在Spark中,如何仅对前N个频繁值进行一次热编码? [英] In Spark, how to do One Hot Encoding for top N frequent values only?

查看:79
本文介绍了在Spark中,如何仅对前N个频繁值进行一次热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在数据框df中,我有一列 my_category ,其中有不同的值,并且可以使用以下方法查看值计数:

Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using:

df.groupBy("my_category").count().show()

value   count
a    197
b    166
c    210
d      5
e      2
f      9
g      3

现在,我想在此列上应用一次热编码(OHE),但仅针对顶部 N 个频繁值(例如 N = 3 ),并将所有其余的不经常出现的值放在哑列中(例如,默认").例如,输出应类似于:

Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like:

a  b  c  default
0  0  1  0
1  0  0  0
0  1  0  0
1  0  0  0
...
0  0  0  1
0  0  0  1
...

如何在Spark/Scala中做到这一点?

注意:我知道如何在Python中执行此操作,例如,首先为每个唯一值建立一个基于频繁度的字典,然后通过逐个检查值并将不经常出现的值放在默认"列中来创建OHE向量./p>

Note: I know how to do this in Python, e.g., by first building a frequent-based dictionary for each unique value, and then create the OHE vector by examining the values one by one, putting the infrequent ones in a "default" column.

推荐答案

可以将自定义函数编写为在特定列上应用一次热编码(OHE),仅对前N个频繁值(例如N = 3)进行编写).

A custom function could be written as, to apply One Hot Encoding (OHE) on a particular column, for the top N frequent values only (say N = 3).

它与Python相对类似,1)建立前n个基于频繁的Dataframe/Dictionary.2)枢转前n个频繁的数据帧,即创建OHE向量.3)左连接给定的数据框和透视数据框,将null替换为0,即默认的OHE向量.

It is relatively similar to Python, 1) Building a top n frequent-based Dataframe/Dictionary. 2) Pivot the top n frequent Dataframe i.e create the OHE vector. 3) Left join the given Dataframe and pivoted Dataframe, replace null with 0 i.e the default OHE vector.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column

import spark.implicits._
val df = spark
  .sparkContext
  .parallelize(Seq("a", "b", "c", "a", "b", "c", "d", "e", "a", "b", "f", "a", "g", "a", "b", "c", "a", "d", "e", "f", "a", "b", "g", "b", "c", "f", "a", "b", "c"))
  .toDF("value")

val oheEncodedDF = oheEncoding(df, "value", 3)


def oheEncoding(df: DataFrame, colName: String, n: Int): DataFrame = {
  df.createOrReplaceTempView("data")
  val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")

  val pivotTopNDF = topNDF
    .groupBy(colName)
    .pivot(colName)
    .count()
    .withColumn("default", lit(1))

  val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)

  val oheEncodedDF = joinedTopNDF
    .na.fill(0, joinedTopNDF.columns)
    .withColumn("default", flip(col("default")))

   oheEncodedDF
}

def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))

这篇关于在Spark中,如何仅对前N个频繁值进行一次热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆