Pyspark 使用子字符串更改列 [英] Pyspark alter column with substring
问题描述
Pyspark n00b... 如何用它自己的子串替换一列?我正在尝试从字符串的开头和结尾删除选定数量的字符.
Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.
from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1]
df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show()
这非常接近但略有不同 Spark Dataframe 列与其他列的最后一个字符.然后是这个PySpark SQL 中的左和右函数
This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL
推荐答案
pyspark.sql.functions.substring(str, pos, len)
pyspark.sql.functions.substring(str, pos, len)
子字符串从 pos 开始,当 str 为 String 类型时长度为 len 或返回字节数组中从 pos 开始的字节数组切片,当 str 为二进制类型时长度为 len
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
在您的代码中,
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null
试试这个,(使用固定语法)
Try this, (with fixed syntax)
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()
这篇关于Pyspark 使用子字符串更改列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!