迭代获取数据框列的最大值,添加一个并重复 spark/sql 中的所有行 [英] Iteratively get the max of a data frame column, add one and repeat for all rows in spark/sql
问题描述
我正在使用 spark 并主要通过 sql 与它进行交互(我对两者都是新手,所以请耐心等待).我在名为 taxonomies
的表中有一个看起来像这样的数据集:
I'm using spark and mainly interacting with it via sql (I'm a novice to both, so please bear with me). I have a data set that looks like this in a table called taxonomies
:
level_4 level_5 level_6 level_7 system_id node_id
American League West null null 4 633
American League East null null 4 634
National League West Dodgers bellinger 3 635
National League Central Cardinals null null null
American League Central null null null null
我将通过左连接另一个表来获取 system_id 列;我想在这里做的是通过在 node_id 列的最大值上加一个来为空值获取一个新的 node_id.
I'm going to get the system_id columns by left joining another table; what I want to do here is get a new node_id for the null values by adding one to the max of the node_id column.
我根据对之前问题的回答改编了此查询:
I adapted this query from an answer to a previous question:
WITH cte AS (
SELECT max(node_id) AS maxid FROM taxonomies
)
SELECT cte.maxid + row_number() OVER (ORDER BY node_id) AS node_id, system_id, `level_4`, `level_5`, `level_6`, `level_7`
FROM taxonomies
LEFT JOIN cte WHERE taxonomies.node_id IS null
UNION
SELECT * FROM taxonomies WHERE node_id IS NOT null
我不明白的是,这适当地为空值创建了新的 node_id,但对于其他所有内容,它交换了 system_id 和 node_id.我尝试在 AS 之后切换 node_id 和 system_id,但这所做的只是更改具有相同结果的列名.我想要的是这个:
What I don't understand is that this appropriately makes new node_id for the null values, but for everything else it swaps system_id and node_id. I've tried switching node_id and system_id after AS, but all this does is change the column names with the same results. What I want is this:
level_4 level_5 level_6 level_7 system_id node_id
American League West null null 4 633
American League East null null 4 634
National League West Dodgers bellinger 3 635
National League Central Cardinals null null 636
American League Central null null null 637
非常感谢任何帮助!
推荐答案
您的问题的解决方案可以是:
A solution to your problem can be this:
SELECT
system_id,
(min(node_id) over ())-1+row_number() OVER (ORDER BY -node_id desc) node_id,
`level_4`, `level_5`, `level_6`, `level_7`
FROM
taxonomies
如果你想使用join来做到这一点,方法如下:
If you want to do this using join, this is the way:
SELECT
system_id,
t2.min_node-1+row_number() OVER (ORDER BY -tn.node_id desc) node_id,
`level_4`, `level_5`, `level_6`, `level_7`
FROM
taxonomies tn
cross join
(SELECT min(node_id) as min_node FROM taxonomies) as t2
如果你想使用子查询来做到这一点,方法如下:
If you want to do this using sub-query, this is the way:
SELECT
system_id,
(select min(node_id) from taxonomies)-1+row_number() OVER (ORDER BY -node_id desc) node_id,
`level_4`, `level_5`, `level_6`, `level_7`
FROM
taxonomies
这篇关于迭代获取数据框列的最大值,添加一个并重复 spark/sql 中的所有行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!