从 Spark 中的数据框列值中删除空格 [英] Remove blank space from data frame column values in Spark

查看:23
本文介绍了从 Spark 中的数据框列值中删除空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个架构的数据框(business_df):

I have a data frame (business_df) of schema:

|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)

我想创建一个新的数据框 (new_df),以便 'name' 列中的值不包含任何空格.

I want to make a new data frame (new_df) so that the values in the 'name' column do not contain any blank spaces.

我的代码是:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()

我不断收到此错误:

NameError: global name 'replace' is not defined

这段代码有什么问题?

推荐答案

虽然你所描述的问题不能用提供的代码重现,但使用 Python UDFs 来处理像这样的简单任务,而是低效.如果您只想从文本中删除空格,请使用 regexp_replace:

While the problem you've described is not reproducible with provided code, using Python UDFs to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace:

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

如果要规范化空行,请使用 trim:

If you want to normalize empty lines use trim:

from pyspark.sql.functions import trim

df.select(trim(col("v")))

如果你想保留前导/尾随空格,你可以调整regexp_replace:

If you want to keep leading / trailing spaces you can adjust regexp_replace:

df.select(regexp_replace(col("v"), "^\s+$", ""))

这篇关于从 Spark 中的数据框列值中删除空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆