在PySpark中替换字符串 [英] Replace string in PySpark

查看:956
本文介绍了在PySpark中替换字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含欧洲格式的数字,我已将其导入为字符串。反之亦然-

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -

from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
|  revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
 |-- revenue: string (nullable = true)

所需的输出:
df.show()

Output desired: df.show()

+---------+
|  revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
 |-- revenue: float (nullable = true)

我正在使用函数 regexp_replace 将第一个替换为空空格-然后用空点替换逗号,最后转换为floatType。

I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))

但是,当我尝试在下面替换时,我得到了空字符串。为什么??我期望的是 -1269,75

But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
|       |
+-------+


推荐答案

您需要转义进行字面匹配,如是一个特殊字符,匹配正则表达式中的几乎所有字符

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:

df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))

这篇关于在PySpark中替换字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆