在PySpark中替换字符串 [英] Replace string in PySpark
本文介绍了在PySpark中替换字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个数据框,其中包含欧洲格式的数字,我已将其导入为字符串。反之亦然-
I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
| revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
|-- revenue: string (nullable = true)
所需的输出:
df.show()
Output desired: df.show()
+---------+
| revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
|-- revenue: float (nullable = true)
我正在使用函数 regexp_replace
将第一个替换为空空格-然后用空点替换逗号,最后转换为floatType。
I am using function regexp_replace
to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))
但是,当我尝试在下面替换时,我得到了空字符串。为什么??我期望的是 -1269,75
。
But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75
.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
| |
+-------+
推荐答案
您需要转义。
进行字面匹配,如。
是一个特殊字符,匹配正则表达式中的几乎所有字符:
You need to escape .
to match it literally, as .
is a special character that matches almost any character in regex:
df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
这篇关于在PySpark中替换字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文