将列从一个数据框添加到另一个而无需联接 [英] Add column from one dataframe to another WITHOUT JOIN

查看:58
本文介绍了将列从一个数据框添加到另一个而无需联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请参见此处谁建议使用Join将列从一个表追加到另一个表.我确实一直在使用这种方法,但是现在对于庞大的表和行列表达到了一定的限制

Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows

假设我有一个包含M个功能的数据框id, salary, age, etc.

Let's say I have a dataframe of M features id, salary, age, etc.

+----+--------+------------+--------------+
| id | salary | age | zone |  ....  
+----+--------+------------+--------------+

我已经对每个功能执行了某些操作,以获得类似这样的结果

I have perform certain operations on each feature to arrive at something like this

+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary | 
+----+--------+------------+--------------+------------+--------------+--------------+--------------+

每个功能都是独立处理的,相同的行列表

Each feature is processed independently, with the same list of rows

+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301  | x1     | x          | x            | x |
| 302  | null   | x          | x            | x |
| 303  | x3     | x          | x            | x |

+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age 
+----+--------+------------+--------------+
| 301  | null   | x          | x   
| 302  | x2     | x          | x   
| 303  | x3     | x          | x   

最后,我想通过结合有效地成百上千个表的唯一ID(每个表对应一个功能),将它们组合到具有每个功能的所有属性的最终数据帧中.最后的数据帧是我的特征向量

In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector

| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age

我达到了导致Out Of Memory异常的内存限制.提高执行程序和驱动程序的内存似乎只是一个临时解决方案,受管理员的限制.

I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.

JOIN昂贵且受pyspark中资源的限制,我想知道是否有可能单独对每个功能表进行预排序,然后保持该顺序,而仅 > APPEND 将整个列彼此相邻,而不是执行昂贵的JOIN.我可以设法为每个功能表保留所有相同的行列表.我希望没有加入或查找,因为我的ID集合是相同的.

JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.

这是如何实现的?据我了解,即使我按ID对每个表进行排序,Spark也会将它们分配用于存储,并且检索(如果我想查询回追加)不能保证具有相同的顺序.

How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.

推荐答案

除了"join"之外,似乎没有一个Spark函数可以将一个DF的列直接附加到另一个DF.

There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.

如果您仅从一个数据框开始,并尝试从该数据框的每个原始列中生成新功能. 我建议使用'pandas_udf',其中所有原始列的新功能都可以添加到'udf'中.

If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe. I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.

这将完全避免使用"join". 要控制内存使用量,请选择组"列,在该列中我们确保每个组都在执行程序内存规范之内.

This will avoid using 'join' at all. To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.

这篇关于将列从一个数据框添加到另一个而无需联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆