如何将整个pyspark数据框的大小写更改为较低或较高 [英] How to change case of whole pyspark dataframe to lower or upper

查看:162
本文介绍了如何将整个pyspark数据框的大小写更改为较低或较高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对两个数据帧中的每一行应用pyspark sql函数哈希算法,以识别差异.哈希算法区分大小写.如果列包含"APPLE"和"Apple"被视为两个不同的值,那么我想将两个数据框的大小写更改为大写或小写.我只能为数据框标题实现,而不能为数据框值实现.请帮助

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help

#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])

推荐答案

两个答案似乎都可以,但有一个例外-如果您有数字列,它将转换为字符串列.为避免这种情况,请尝试:

Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))

val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)

当您有非字符串字段(即数字字段)时,现在的类型也是正确的. 如果您知道每一列都是String类型,请使用其他答案之一-在这种情况下,它们是正确的:)

Now types are correct also when you have non-string fields, i.e. numeric fields). If you know that each column is of String type, use one of the other answers - they are correct in that cases :)

PySpark中的Python代码:

Python code in PySpark:

from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
 fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields) 
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)

这篇关于如何将整个pyspark数据框的大小写更改为较低或较高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆