Pyspark - 拆分一列并取 n 个元素 [英] Pyspark - Split a column and take n elements

查看：15 发布时间：2021/11/14 22:12:17 apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了Pyspark - 拆分一列并取 n 个元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想取一列并使用字符拆分字符串.按照惯例，我知道 split 方法会返回一个列表，但是在编码时我发现返回的对象只有 getItem 或 getField 方法，API 中的描述如下:

<块引用>

@since(1.3)def getItem(self, key):"""从列表中获取位于ordinal"位置的项目的表达式，或从字典中通过键获取项目.@因为(1.3)def getField(self, name):"""在 StructField 中按名称获取字段的表达式.

显然这不符合我的要求，例如对于A_B_C_D"列中的文本，我想在两个不同的列中将A_B_C_"和D"分开.

这是我正在使用的代码

from pyspark.sql.functions import regexp_extract, col, splitdf_test=spark.sql("SELECT * FROM db_test.table_test")#对数据应用转换split_col=split(df_test['Full_text'],'_')df_split=df_test.withColumn('Last_Item',split_col.getItem(3))

找一个例子:

from pyspark.sql import Row从 pyspark.sql.functions 导入 regexp_extract, col, splitl = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]rdd = sc.parallelize(l)datax = rdd.map(lambda x: Row(fullString=x))df = sqlContext.createDataFrame(datax)split_col=split(df['fullString'],'_')df=df.withColumn('LastItemOfSplit',split_col.getItem(2))

结果:

fullString LastItemOfSplitItem1_Item2_ItemN ItemNFirstItem_SecondItem_LastItem LastItemThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn null

我的预期结果总是最后一项

fullString LastItemOfSplitItem1_Item2_ItemN ItemNFirstItem_SecondItem_LastItem LastItemThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn ThisShouldBeInTheLastColumn

解决方案

您可以使用 getItem(size - 1) 从数组中获取最后一项:

示例:

df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['拆分'])df.show()+------------+|拆分|+------------+|[A, B, C, D]||[E, F]|+------------+将 pyspark.sql.functions 导入为 Fdf.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()+------------+--------+|拆分|最后一项|+------------+--------+|[A, B, C, D]|D||[E, F]|F|+------------+--------+

对于您的情况:

from pyspark.sql.functions import regexp_extract, col, split, sizedf_test=spark.sql("SELECT * FROM db_test.table_test")#对数据应用转换split_col=split(df_test['Full_text'],'_')df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))

I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API:

@since(1.3)   
def getItem(self, key):
    """
    An expression that gets an item at position ``ordinal`` out of a list,
    or gets an item by key out of a dict.


@since(1.3)
def getField(self, name):
    """
    An expression that gets a field by name in a StructField.

Obviously this doesnt meet my requirements, for example for the text within the column "A_B_C_D" I would like to split between "A_B_C_" and "D" in two different columns.

This is the code I'm using

from pyspark.sql.functions import regexp_extract, col, split
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data

split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(3))

Find an example:

from pyspark.sql import Row
from pyspark.sql.functions import regexp_extract, col, split
l = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]
rdd = sc.parallelize(l)
datax = rdd.map(lambda x: Row(fullString=x))
df = sqlContext.createDataFrame(datax)
split_col=split(df['fullString'],'_')
df=df.withColumn('LastItemOfSplit',split_col.getItem(2))

Result:

fullString                                                LastItemOfSplit
Item1_Item2_ItemN                                            ItemN
FirstItem_SecondItem_LastItem                                LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn     null

My expected result would be having always the last item

fullString                                                LastItemOfSplit
Item1_Item2_ItemN                                            ItemN
FirstItem_SecondItem_LastItem                                LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn  ThisShouldBeInTheLastColumn

解决方案

You can use getItem(size - 1) to get the last item from the arrays:

Example:

df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['split'])
df.show()
+------------+
|       split|
+------------+
|[A, B, C, D]|
|      [E, F]|
+------------+

import pyspark.sql.functions as F
df.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()
+------------+--------+
|       split|lastItem|
+------------+--------+
|[A, B, C, D]|       D|
|      [E, F]|       F|
+------------+--------+

For your case:

from pyspark.sql.functions import regexp_extract, col, split, size
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data

split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))

这篇关于Pyspark - 拆分一列并取 n 个元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark - 拆分一列并取 n 个元素 [英] Pyspark - Split a column and take n elements

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark - 拆分一列并取 n 个元素 [英] Pyspark - Split a column and take n elements

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭