PySpark 根据名称将列表分解为多列 [英] PySpark explode list into multiple columns based on name
问题描述
您好,我正在处理一种稍微困难的文件格式,我正在尝试对其进行清理以备将来处理.我一直在使用 Pyspark 将数据处理成数据帧.
Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe.
文件看起来像这样:
AA 1234 ZXYW
BB A 890
CC B 321
AA 1234 LMNO
BB D 123
CC E 321
AA 1234 ZXYW
CC E 456
每个AA"记录定义了一个或多个逻辑组的开始,每行上的数据都是固定长度的,其中包含我想要提取的编码信息.至少有 20-30 种不同的记录类型.它们总是在每行的开头用两个字母的代码标识.每个组中可以有 1 个或多个不同的记录类型(即并非每个组都存在所有记录类型)
Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record types. They are always identified with a two letter code at the start of each line. There can be 1 or many different record types in each group (i.e. not all record types are present for each group)
作为第一阶段,我设法以这种格式将记录组合在一起:
As a first stage, I've managed to group the records together in this format:
+----------------+---------------------------------+
| index| result|
+----------------+---------------------------------+
| 1|[AA 1234 ZXYV,BB A 890,CC B 321]|
| 2|[AA 1234 LMNO,BB D 123,CC E 321]|
| 3|[AA 1234 ZXYV,CC B 321] |
+----------------+---------------------------------+
作为第二阶段,我真的想将数据放入数据框中的以下列:
And as a second stage I really want to get data into the following columns in a dataframe:
+----------------+---------------------------------+-------------+--------+--------+
| index| result| AA| BB| CC|
+----------------+---------------------------------+-------------+--------+--------+
| 1|[AA 1234 ZXYV,BB A 890,CC B 321]|AA 1234 ZXYV|BB A 890|CC B 321|
| 2|[AA 1234 LMNO,BB D 123,CC E 321]|AA 1234 LMNO|BB D 123|CC E 321|
| 3|[AA 1234 ZXYV,CC B 321] |AA 1234 ZXYV| Null|CC B 321|
+----------------+---------------------------------+-------------+--------+--------+
因为那时提取我需要的信息应该是微不足道的.
Because at that point extracting the information that I need should be trivial.
有人对我如何做到这一点有任何建议吗?
Does anyone have any suggestions as to how I might be able to do this?
非常感谢.
推荐答案
您可以使用 flatMap
和 pivot
来实现这一点.从第一阶段的结果开始:
You can use flatMap
and pivot
to achieve this. Starting from the results from the first stage:
rdd = sc.parallelize([(1,['AA 1234 ZXYV','BB A 890','CC B 321']),
(2,['AA 1234 LMNO','BB D 123','CC E 321']),
(3,['AA 1234 ZXYV','CC B 321'])])
df = rdd.toDF(['index', 'result'])
您可以先使用 flatMap
将数组分解为多行,然后将两个字母的标识符提取到单独的列中.
You can first explode the array into multiple rows using flatMap
and extract the two letter identifier into a separate column.
df_flattened = df.rdd.flatMap(lambda x: [(x[0],y, y[0:2],y[3::]) for y in x[1]])\
.toDF(['index','result', 'identifier','identifiertype'])
并使用 pivot
将两个字母标识符更改为列名:
and use pivot
to change the two letter identifier into column names:
df_result = df_flattened.groupby(df_flattened.index,)\
.pivot("identifier")\
.agg(first("identifiertype"))\
.join(df,'index')
我添加了连接以返回 result
列
I added the join to get the result
column back
这篇关于PySpark 根据名称将列表分解为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!