在 python 或 Pyspark 数据框中重命名带有特殊字符的列 [英] Rename columns with special characters in python or Pyspark dataframe

查看:38
本文介绍了在 python 或 Pyspark 数据框中重命名带有特殊字符的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python/pyspark 中有一个数据框.列具有特殊字符,如点(.) 空格、括号(()) 和括号{}.以他们的名义.

现在我想重命名列名,如果有点和空格,则用下划线替换它们,如果有 () 和 {},则将它们从列名中删除.

我已经这样做了

df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))

有了这个,我能够用下划线替换点和空格,无法做第二位,即如果 () 和 {} 只是从列名中删除它们.

我们如何实现这一目标.

解决方案

如果您有一个 pyspark 数据框,您可以尝试使用 withColumnRenamed 函数来重命名列.我确实以我的方式尝试过,看看并根据您的更改对其进行自定义.

<预><代码>>>>l=[('some value1','some value2','some value 3'),('some value4','some value5','some value 6')]>>>l_schema = StructType([StructField("col1.some valwith(in)and{around}",StringType(),True),StructField("col2.some valwith()and{}",StringType(),True),StructField("col3 some()valwith.and{}",StringType(),True)])>>>reps=('.','_'),('','_'),('(',''),(')',''),('{','')('}','')>>>rdd = sc.parallelize(l)>>>df = sqlContext.createDataFrame(rdd,l_schema)>>>df.printSchema()根|-- col1.some valwith(in)and{around}: string (nullable = true)|-- col2.some valwith()and{}: string (nullable = true)|-- col3 some()valwith.and{}: string (nullable = true)>>>df.show()+------------------------+------------------------+------------------------+|col1.some valwith(in)and{around}|col2.some valwith()and{}|col3 some()valwith.and{}|+------------------------+------------------------+------------------------+|一些价值1|一些价值2|一些值 3||一些价值4|一些价值5|一些值 6|+------------------------+------------------------+------------------------+>>>def colrename(x):... return reduce(lambda a,kv : a.replace(*kv),reps,x)>>>对于 df.schema.names 中的 i:... df = df.withColumnRenamed(i,colrename(i))>>>df.printSchema()根|-- col1_some_valwithinandaround: string (nullable = true)|-- col2_some_valwithand: string (nullable = true)|-- col3_somevalwith_and: string (nullable = true)>>>df.show()+--------------------+--------------------+------------------+|col1_some_valwithinandaround|col2_some_valwithand|col3_somevalwith_and|+--------------------+--------------------+------------------+|一些价值1|一些价值2|一些值 3||一些价值4|一些价值5|一些值 6|+--------------------+--------------------+------------------+

I have a data frame in python/pyspark. The columns have special characters like dot(.) spaces brackets(()) and parenthesis {}. in their names.

Now I want to rename the column names in such a way that if there are dot and spaces replace them with underscore and if there are () and {} then remove them from the column names.

I have done this

df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))

with this I was able to replace the dot and spaces with underscores with Unable to do the second bit i.e if () and {} are there just remove them form column names.

How do we achieve that.

解决方案

If you are having a pyspark dataframe, you can try using withColumnRenamed function to rename the columns. I did try in my way, have a look and customize it for your changes.

>>> l=[('some value1','some value2','some value 3'),('some value4','some value5','some value 6')]
>>> l_schema = StructType([StructField("col1.some valwith(in)and{around}",StringType(),True),StructField("col2.some valwith()and{}",StringType(),True),StructField("col3 some()valwith.and{}",StringType(),True)])
>>> reps=('.','_'),(' ','_'),('(',''),(')',''),('{','')('}','')
>>> rdd = sc.parallelize(l)
>>> df = sqlContext.createDataFrame(rdd,l_schema)
>>> df.printSchema()
root
 |-- col1.some valwith(in)and{around}: string (nullable = true)
 |-- col2.some valwith()and{}: string (nullable = true)
 |-- col3 some()valwith.and{}: string (nullable = true)

>>> df.show()
+------------------------+------------------------+------------------------+
|col1.some valwith(in)and{around}|col2.some valwith()and{}|col3 some()valwith.and{}|
+------------------------+------------------------+------------------------+
|             some value1|             some value2|            some value 3|
|             some value4|             some value5|            some value 6|
+------------------------+------------------------+------------------------+

>>> def colrename(x):
...    return reduce(lambda a,kv : a.replace(*kv),reps,x)
>>> for i in df.schema.names:
...    df = df.withColumnRenamed(i,colrename(i))
>>> df.printSchema()
root
 |-- col1_some_valwithinandaround: string (nullable = true)
 |-- col2_some_valwithand: string (nullable = true)
 |-- col3_somevalwith_and: string (nullable = true)

>>> df.show()
+--------------------+--------------------+--------------------+
|col1_some_valwithinandaround|col2_some_valwithand|col3_somevalwith_and|
+--------------------+--------------------+--------------------+
|                 some value1|         some value2|        some value 3|
|                 some value4|         some value5|        some value 6|
+--------------------+--------------------+--------------------+

这篇关于在 python 或 Pyspark 数据框中重命名带有特殊字符的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆