如何使用Spark 2屏蔽列? [英] How to mask columns using Spark 2?

查看：52 发布时间：2020/9/4 3:47:37 scala apache-spark apache-spark-sql apache-spark-2.0

本文介绍了如何使用Spark 2屏蔽列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些表需要掩盖其中的某些列.每个表中要屏蔽的列各不相同，我正在从application.conf文件中读取这些列.

I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file.

例如，如下所示的员工表

For example, for employee table as shown below

+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1  | abcd | 21  | India   |
+----+------+-----+---------+
| 2  | qazx | 42  | Germany |
+----+------+-----+---------+

如果要屏蔽名称和年龄列，那么我将按顺序获取这些列.

if we want to mask name and age columns then I get these columns in an sequence.

val mask = Seq("name", "age")

屏蔽后的预期值为:

+----+----------------+----------------+---------+
| id | name           | age            | address |
+----+----------------+----------------+---------+
| 1  | *** Masked *** | *** Masked *** | India   |
+----+----------------+----------------+---------+
| 2  | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+

如果我有雇员表一个数据框，那么屏蔽这些列的方法是什么?

If I have employee table an data frame, then what is the way to mask these columns?

如果我具有如下所示的payment表，并且想要屏蔽name和salary列，那么我在Sequence中将掩码列设为

If I have payment table as shown below and want to mask name and salary columns then I get mask columns in Sequence as

+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1  | abcd | 12345  | KT10     |
+----+------+--------+----------+
| 2  | qazx | 98765  | AD12d    |
+----+------+--------+----------+

val mask = Seq("name", "salary")

我尝试了类似mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )的操作，但未返回任何内容.

I tried something like this mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) ) but it did not returned anything.

由于@philantrovert，我找到了解决方案.这是我使用的解决方案:

Thanks to @philantrovert, I found out the solution. Here is the solution I used:

def maskData(base: DataFrame, maskColumns: Seq[String]) = {
    val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
    base.selectExpr(maskExpr: _*)
}

推荐答案

您的声明

mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )

将返回一个听起来不太好的List[org.apache.spark.sql.DataFrame].

will return a List[org.apache.spark.sql.DataFrame] which doesn't sound too good.

您可以使用selectExpr并使用:

base.show
+---+----+-----+-------+
| id|name|  age|address|
+---+----+-----+-------+
|  1|abcd|12345|  KT10 |
|  2|qazx|98765|  AD12d|
+---+----+-----+-------+

val mask = Seq("name", "age")
val expr = df.columns.map { col =>
   if (mask.contains(col) ) s"""regexp_replace(${col}, "^.*", "** Masked **" ) as ${col}"""
   else col
 }

这将为序列mask

Array[String] = Array(id, regexp_replace(name, "^.*", "** Masked **" ) as name, regexp_replace(age, "^.*", "** Masked **" ) as age, address)

现在您可以在生成的序列上使用selectExpr

Now you can use selectExpr on the generated Sequence

base.selectExpr(expr: _*).show

+---+------------+------------+-------+
| id|        name|         age|address|
+---+------------+------------+-------+
|  1|** Masked **|** Masked **|  KT10 |
|  2|** Masked **|** Masked **|  AD12d|
+---+------------+------------+-------+

这篇关于如何使用Spark 2屏蔽列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Spark 2屏蔽列? [英] How to mask columns using Spark 2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用Spark 2屏蔽列? [英] How to mask columns using Spark 2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭