带有子字符串的Pyspark Alter列 [英] Pyspark alter column with substring

查看:75
本文介绍了带有子字符串的Pyspark Alter列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pyspark n00b ...如何用自身的子字符串替换列?我正在尝试从字符串的开头和结尾删除选定数量的字符.

Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.

from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] 

df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show() 

这非常接近,但与"Spark Dataframe"列其他列的最后一个字符.然后有这个 PySpark SQL中的LEFT和RIGHT函数

This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL

推荐答案

pyspark.sql.functions.substring(str,pos,len)

pyspark.sql.functions.substring(str, pos, len)

子字符串以pos开头,当str为String类型时长度为len,或者返回str是以Bin为类型时以pos字节开始且长度为len的字节数组切片

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

在您的代码中,

df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null

尝试一下(使用固定语法)

Try this, (with fixed syntax)

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()

这篇关于带有子字符串的Pyspark Alter列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆