如何将整个 pyspark 数据框的大小写更改为较低或较高 [英] How to change case of whole pyspark dataframe to lower or upper

查看:21
本文介绍了如何将整个 pyspark 数据框的大小写更改为较低或较高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对两个数据帧中的每一行应用 pyspark sql 函数哈希算法来识别差异.哈希算法区分大小写.即如果列包含 'APPLE' 和 'Apple' 被视为两个不同的值,所以我想将两个数据帧的大小写更改为上限或下限.我只能实现数据帧标题,但不能实现数据帧值.请帮助

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help

#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])

推荐答案

两个答案似乎都可以,但有一个例外 - 如果您有数字列,它将被转换为字符串列.为避免这种情况,请尝试:

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))

val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)

当您有非字符串字段(即数字字段)时,现在类型也是正确的).如果您知道每一列都是 String 类型,请使用其他答案之一 - 在这种情况下它们是正确的 :)

Now types are correct also when you have non-string fields, i.e. numeric fields). If you know that each column is of String type, use one of the other answers - they are correct in that cases :)

PySpark 中的 Python 代码:

Python code in PySpark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
 fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)

这篇关于如何将整个 pyspark 数据框的大小写更改为较低或较高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆