Pyspark 使用子字符串更改列 [英] Pyspark alter column with substring

查看:29
本文介绍了Pyspark 使用子字符串更改列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pyspark n00b... 如何用它自己的子串替换一列?我正在尝试从字符串的开头和结尾删除选定数量的字符.

Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.

from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] 

df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show() 

这非常接近但略有不同 Spark Dataframe 列与其他列的最后一个字符.然后是这个PySpark SQL 中的左和右函数

This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL

推荐答案

pyspark.sql.functions.substring(str, pos, len)

pyspark.sql.functions.substring(str, pos, len)

子字符串从 pos 开始,当 str 为 String 类型时长度为 len 或返回字节数组中从 pos 开始的字节数组切片,当 str 为二进制类型时长度为 len

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

在您的代码中,

df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null

试试这个,(使用固定语法)

Try this, (with fixed syntax)

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()

这篇关于Pyspark 使用子字符串更改列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆