从pyspark的dataframe列中删除最后一个管道分隔的值 [英] remove last pipe-delimited value from dataframe column in pyspark

查看:49
本文介绍了从pyspark的dataframe列中删除最后一个管道分隔的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spark 2.1,并在dataframe列中包含类似于 AB | 12 | XY | 4 的值。
我想通过删除最后一个元素来创建新列,因此它应显示为 AB | 12 | XY

I am using spark 2.1 and have a dataframe column contain value like AB|12|XY|4. I want to create a new column by removing the last element, so it should show like AB|12|XY.

我尝试拆分,rsplit无效,因此需要一些建议以获取所需的输出。

I tried to split, rsplit did not work, so need some suggestion to get the desired output.

推荐答案

使用Spark SQL split 函数如下:

Use the Spark SQL split function as follows:

>>> from pyspark.sql.functions import split
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))
>>> df.show()
+------------------+                   
|                c1|                  
+------------------+                  
|        AB|12|XY|4|                  
|11|22|33|44|remove|                  
+------------------+                  

>>> df2 = df.withColumn("c2", split(df.c1, '\|\w+$')[0])  # split takes a regex pattern
>>> df2.show()
+------------------+-----------+
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+ 

如果您需要做一些无法使用内置函数实现的复杂操作,则可以定义自己的用户定义函数(UDF):

If you need to do something more complicated that can't be implemented using the built-in functions, you can define your own user-defined function (UDF):

>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>> def my_func(str):
...   return str.rsplit('|',1)[0]
...
>>> my_udf    = udf(my_func, StringType())
>>> json_data = ['{"c1":"AB|12|XY|4"}','{"c1":"11|22|33|44|remove"}']
>>> df        = spark.read.json(sc.parallelize(json_data))

>>> df2 = df.withColumn("c2", my_udf(df.c1))
>>> df2.show()
+------------------+-----------+ 
|                c1|         c2|
+------------------+-----------+
|        AB|12|XY|4|   AB|12|XY|
|11|22|33|44|remove|11|22|33|44|
+------------------+-----------+

内置首选SQL函数(也此处),因为您的数据不会在JVM进程和Python进程之间来回传递。使用UDF时会发生这种情况。

Built-in SQL functions are preferred (also here) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.

这篇关于从pyspark的dataframe列中删除最后一个管道分隔的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆