带有子字符串的Pyspark Alter列 [英] Pyspark alter column with substring
问题描述
Pyspark n00b ...如何用自身的子字符串替换列?我正在尝试从字符串的开头和结尾删除选定数量的字符.
Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.
from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1]
df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show()
这非常接近,但与"Spark Dataframe"列其他列的最后一个字符.然后有这个 PySpark SQL中的LEFT和RIGHT函数
This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL
推荐答案
pyspark.sql.functions.substring(str,pos,len)
pyspark.sql.functions.substring(str, pos, len)
子字符串以pos开头,当str为String类型时长度为len,或者返回str是以Bin为类型时以pos字节开始且长度为len的字节数组切片
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
在您的代码中,
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null
尝试一下(使用固定语法)
Try this, (with fixed syntax)
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()
这篇关于带有子字符串的Pyspark Alter列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!