删除 PySpark 数据框列中的最后几个字符 [英] remove last few characters in PySpark dataframe column

查看:108
本文介绍了删除 PySpark 数据框列中的最后几个字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 PySpark DataFrame.如何从下面的 name 列中删除/删除最后 5 个字符 -

from pyspark.sql.functions 导入子串,长度valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]df = sqlContext.createDataFrame(valuesCol,['name'])df.show()+--------------+|姓名|+--------------+|玫瑰_2012||茉莉花_2013||百合_2014||水仙花_2017||向日葵_2016|+--------------+

我想创建 2 列,floweryear 列.

预期输出:

+--------------+----+---------+|姓名|年份|花|+--------------+----+---------+|玫瑰_2012|2012|玫瑰||茉莉花_2013|2013|茉莉花||百合_2014|2014|百合||水仙花_2017|2017|水仙花||向日葵_2016|2016|亚花|+--------------+----+---------+

我创建的

year 列 -

df = df.withColumn("year", substring(col("name"),-4,4))df.show()+--------------+----+|姓名|年份|+--------------+----+|玫瑰_2012|2012||茉莉花_2013|2013||百合_2014|2014||水仙花_2017|2017||向日葵_2016|2016|+--------------+----+

我不知道怎么把最后5个字符砍掉,所以我只有花的名字.我通过调用 length 尝试了类似的方法,但这不起作用.

df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))

如何创建只有花名的 flower 列?

解决方案

可以使用expr函数

<预><代码>>>>从 pyspark.sql.functions 导入子字符串、长度、col、expr>>>df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))>>>df.show()+--------------+----+---------+|姓名|年份|花|+--------------+----+---------+|玫瑰_2012|2012|玫瑰||茉莉花_2013|2013|茉莉花||百合_2014|2014|百合||水仙花_2017|2017|水仙花||向日葵_2016|2016|向日葵|+--------------+----+---------+

I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name below -

from pyspark.sql.functions import substring, length
valuesCol = [('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)]
df = sqlContext.createDataFrame(valuesCol,['name'])
df.show()

+--------------+
|          name|
+--------------+
|     rose_2012|
|  jasmine_2013|
|     lily_2014|
| daffodil_2017|
|sunflower_2016|
+--------------+

I want to create 2 columns, the flower and year column.

Expected output:

+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|subflower|
+--------------+----+---------+

year column I have created -

df = df.withColumn("year", substring(col("name"),-4,4))
df.show()
+--------------+----+
|          name|year|
+--------------+----+
|     rose_2012|2012|
|  jasmine_2013|2013|
|     lily_2014|2014|
| daffodil_2017|2017|
|sunflower_2016|2016|
+--------------+----+

I don't know how to chop last 5 characters, so that I only have the name of flowers. I tried something like this, by invoking length, but that doesn't work.

df = df.withColumn("flower",substring(col("name"),0,length(col("name"))-5))

How can I create flower column with only flower names?

解决方案

You can use expr function

>>> from pyspark.sql.functions import substring, length, col, expr
>>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
>>> df.show()
+--------------+----+---------+
|          name|year|   flower|
+--------------+----+---------+
|     rose_2012|2012|     rose|
|  jasmine_2013|2013|  jasmine|
|     lily_2014|2014|     lily|
| daffodil_2017|2017| daffodil|
|sunflower_2016|2016|sunflower|
+--------------+----+---------+

这篇关于删除 PySpark 数据框列中的最后几个字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆