如何使用 Spark 2 屏蔽列? [英] How to mask columns using Spark 2?
问题描述
我有一些表格,需要屏蔽其中的一些列.要屏蔽的列因表而异,我正在从 application.conf
文件中读取这些列.
I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf
file.
以employee表为例,如下图
For example, for employee table as shown below
+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1 | abcd | 21 | India |
+----+------+-----+---------+
| 2 | qazx | 42 | Germany |
+----+------+-----+---------+
如果我们想屏蔽 name 和 age 列,那么我会按顺序获取这些列.
if we want to mask name and age columns then I get these columns in an sequence.
val mask = Seq("name", "age")
屏蔽后的预期值为:
+----+----------------+----------------+---------+
| id | name | age | address |
+----+----------------+----------------+---------+
| 1 | *** Masked *** | *** Masked *** | India |
+----+----------------+----------------+---------+
| 2 | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+
如果我有员工表一个数据框,那么屏蔽这些列的方法是什么?
If I have employee table an data frame, then what is the way to mask these columns?
如果我有如下所示的 payment
表并且想要屏蔽 name
和 salary
列,那么我将 Sequence 中的掩码列设为 >
If I have payment
table as shown below and want to mask name
and salary
columns then I get mask columns in Sequence as
+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1 | abcd | 12345 | KT10 |
+----+------+--------+----------+
| 2 | qazx | 98765 | AD12d |
+----+------+--------+----------+
val mask = Seq("name", "salary")
我试过这样的东西 mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" )) )
但它没有返回任何东西.
I tried something like this mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )
but it did not returned anything.
感谢@philantrovert,我找到了解决方案.这是我使用的解决方案:
Thanks to @philantrovert, I found out the solution. Here is the solution I used:
def maskData(base: DataFrame, maskColumns: Seq[String]) = {
val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
base.selectExpr(maskExpr: _*)
}
推荐答案
你的陈述
mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )
将返回一个听起来不太好的List[org.apache.spark.sql.DataFrame]
.
will return a List[org.apache.spark.sql.DataFrame]
which doesn't sound too good.
您可以使用 selectExpr
并使用以下命令生成您的 regexp_replace
表达式:
You can use selectExpr
and generate your regexp_replace
expression using :
base.show
+---+----+-----+-------+
| id|name| age|address|
+---+----+-----+-------+
| 1|abcd|12345| KT10 |
| 2|qazx|98765| AD12d|
+---+----+-----+-------+
val mask = Seq("name", "age")
val expr = df.columns.map { col =>
if (mask.contains(col) ) s"""regexp_replace(${col}, "^.*", "** Masked **" ) as ${col}"""
else col
}
这将为序列 mask
Array[String] = Array(id, regexp_replace(name, "^.*", "** Masked **" ) as name, regexp_replace(age, "^.*", "** Masked **" ) as age, address)
现在您可以在生成的序列上使用 selectExpr
Now you can use selectExpr
on the generated Sequence
base.selectExpr(expr: _*).show
+---+------------+------------+-------+
| id| name| age|address|
+---+------------+------------+-------+
| 1|** Masked **|** Masked **| KT10 |
| 2|** Masked **|** Masked **| AD12d|
+---+------------+------------+-------+
这篇关于如何使用 Spark 2 屏蔽列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!