用python或Pyspark数据框中的特殊字符重命名列 [英] Rename columns with special characters in python or Pyspark dataframe
问题描述
我在python/pyspark中有一个数据框.这些列具有特殊字符,例如点(.),空格,括号(())和括号{}.用他们的名字.
I have a data frame in python/pyspark. The columns have special characters like dot(.) spaces brackets(()) and parenthesis {}. in their names.
现在,我想以这样的方式重命名列名称:如果有点和空格,请用下划线替换它们,如果有()和{},则将其从列名称中删除.
Now I want to rename the column names in such a way that if there are dot and spaces replace them with underscore and if there are () and {} then remove them from the column names.
我已经完成了
df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))
通过此操作,我能够用不能执行第二位的下划线替换点和空格,即如果()和{}在那里,则将它们从列名称中删除.
with this I was able to replace the dot and spaces with underscores with Unable to do the second bit i.e if () and {} are there just remove them form column names.
我们如何实现这一目标.
How do we achieve that.
推荐答案
如果拥有pyspark数据框,则可以尝试使用withColumnRenamed函数重命名列.我确实尝试过,看看并为您的更改自定义它.
If you are having a pyspark dataframe, you can try using withColumnRenamed function to rename the columns. I did try in my way, have a look and customize it for your changes.
>>> l=[('some value1','some value2','some value 3'),('some value4','some value5','some value 6')]
>>> l_schema = StructType([StructField("col1.some valwith(in)and{around}",StringType(),True),StructField("col2.some valwith()and{}",StringType(),True),StructField("col3 some()valwith.and{}",StringType(),True)])
>>> reps=('.','_'),(' ','_'),('(',''),(')',''),('{','')('}','')
>>> rdd = sc.parallelize(l)
>>> df = sqlContext.createDataFrame(rdd,l_schema)
>>> df.printSchema()
root
|-- col1.some valwith(in)and{around}: string (nullable = true)
|-- col2.some valwith()and{}: string (nullable = true)
|-- col3 some()valwith.and{}: string (nullable = true)
>>> df.show()
+------------------------+------------------------+------------------------+
|col1.some valwith(in)and{around}|col2.some valwith()and{}|col3 some()valwith.and{}|
+------------------------+------------------------+------------------------+
| some value1| some value2| some value 3|
| some value4| some value5| some value 6|
+------------------------+------------------------+------------------------+
>>> def colrename(x):
... return reduce(lambda a,kv : a.replace(*kv),reps,x)
>>> for i in df.schema.names:
... df = df.withColumnRenamed(i,colrename(i))
>>> df.printSchema()
root
|-- col1_some_valwithinandaround: string (nullable = true)
|-- col2_some_valwithand: string (nullable = true)
|-- col3_somevalwith_and: string (nullable = true)
>>> df.show()
+--------------------+--------------------+--------------------+
|col1_some_valwithinandaround|col2_some_valwithand|col3_somevalwith_and|
+--------------------+--------------------+--------------------+
| some value1| some value2| some value 3|
| some value4| some value5| some value 6|
+--------------------+--------------------+--------------------+
这篇关于用python或Pyspark数据框中的特殊字符重命名列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!