在 PySpark 中将字符串列标记化和排名为多列 [英] Tokenizing and ranking a string column into multiple columns in PySpark

查看:41
本文介绍了在 PySpark 中将字符串列标记化和排名为多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 PySpark 数据框,它有一个字符串列,其中包含逗号分隔的未排序值列表(最多 5 个值),如下所示:

I have a PySpark dataframe that has a string column which contains a comma separated, unsorted list of values (up to 5 values), like this:

+----+----------------------+
|col1|col2                  |
+----+----------------------+
|1   | 'b1, a1, c1'         |
|2   | 'a2, b2'             |
|3   | 'e3, d3, a3, c3, b3' |
+----+----------------------+

我想对 col2 进行分词,然后根据标准对它们进行排名,并从 col2 中创建 5 个新的不同列,如果分词返回小于5 个值.排序很简单:如果token在set1,就放在第一个新列(col3),否则如果在set2,就放在第二个新列(col4),以此类推.

I want to tokenize col2 and then rank them based on a criteria and create 5 new different columns out of col2, possibly with null values if the tokenization returns less than 5 values. The ranking is simple: If the token is in set1, put it in the first new column (col3), else if it is in set2, put it in the second new column (col4) and so on.

让我们说:

set1 = ['a1', 'a2', 'a3', 'a4', 'a5'], 
set2 = ['b1', 'b2', 'b3', 'b4', 'b5'], 
set3 = ['c1', 'c2', 'c3', 'c4', 'c5'], 
set4 = ['d1', 'd2', 'd3', 'd4', 'd5'], 
set5 = ['e1', 'e2', 'e3', 'e4', 'e5']

然后对上面的数据框应用更改将产生以下数据框:

Then applying the change on the dataframe above will result in the following dataframe:

+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|1   |'a1'|'b1'|'c1'|null|null|
|2   |'a2'|'b2'|null|null|null|
|3   |'a3'|'b3'|'c3'|'d3'|'e3'|
+----+----+----+----+----+----+

我知道如何进行标记化:

I know how to do tokenization:

df.withColumn('col2', split('col2', ', ')) \
  .select(col('col1'), *[col('col2')[i].alias('col' + str(i + 3)) for i in range(0, 5)]) \
  .show()

但在创建新列之前无法弄清楚如何执行排名.任何帮助将不胜感激.

but can't figure out how to perform ranking before creating the new columns. Any help would be much appreciated.

推荐答案

我找到了解决方案.我们可以使用 udf 根据集合对该列中的字符串列表进行排序.然后在 udf 函数之上应用标记化并从中创建不同的列.

I found a solution for this. We can use a udf that sorts the list of strings in that column based on the sets. Then apply the tokenization on top of the udf function and create different columns from it.

set1 = set(['a1', 'a2', 'a3', 'a4', 'a5'])
set2 = set(['b1', 'b2', 'b3', 'b4', 'b5'])
set3 = set(['c1', 'c2', 'c3', 'c4', 'c5'])
set4 = set(['d1', 'd2', 'd3', 'd4', 'd5'])
set5 = set(['e1', 'e2', 'e3', 'e4', 'e5'])

def sortCategories(x):
    resultArray = ['unknown' for i in range(5)]
    tokens = x.split(',')
    for token in tokens:
        if token in set1:
            resultArray[0] = token
        elif token in set2:
            resultArray[1] = token
        elif token in set3:
            resultArray[2] = token
        elif token in set4:
            resultArray[3] = token
        elif token in set5:
            resultArray[4] = token
    return ','.join(resultArray)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
orderUdfString = udf(lambda s: sortCategories(s), StringType())
df = df.withColumn('col2', orderUdfString('col2'))
df = df.withColumn('col_temp', split('col2', ',')) \
  .select([col(c) for c in df.columns] + [col('col_temp')[i].alias('col' + str(i + 1)) for i in range(0, 5)])

这篇关于在 PySpark 中将字符串列标记化和排名为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆