将列从一个数据框添加到另一个而无需联接 [英] Add column from one dataframe to another WITHOUT JOIN

查看：58 发布时间：2021/2/12 19:38:14 python apache-spark dataframe join pyspark

本文介绍了将列从一个数据框添加到另一个而无需联接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

请参见此处谁建议使用Join将列从一个表追加到另一个表.我确实一直在使用这种方法，但是现在对于庞大的表和行列表达到了一定的限制

Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows

假设我有一个包含M个功能的数据框id, salary, age, etc.

Let's say I have a dataframe of M features id, salary, age, etc.

+----+--------+------------+--------------+
| id | salary | age | zone |  ....  
+----+--------+------------+--------------+

我已经对每个功能执行了某些操作，以获得类似这样的结果

I have perform certain operations on each feature to arrive at something like this

+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary | 
+----+--------+------------+--------------+------------+--------------+--------------+--------------+

每个功能都是独立处理的，相同的行列表

Each feature is processed independently, with the same list of rows

+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301  | x1     | x          | x            | x |
| 302  | null   | x          | x            | x |
| 303  | x3     | x          | x            | x |

+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age 
+----+--------+------------+--------------+
| 301  | null   | x          | x   
| 302  | x2     | x          | x   
| 303  | x3     | x          | x

最后，我想通过结合有效地成百上千个表的唯一ID(每个表对应一个功能)，将它们组合到具有每个功能的所有属性的最终数据帧中.最后的数据帧是我的特征向量

In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector

| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age

我达到了导致Out Of Memory异常的内存限制.提高执行程序和驱动程序的内存似乎只是一个临时解决方案，受管理员的限制.

I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.

JOIN昂贵且受pyspark中资源的限制，我想知道是否有可能单独对每个功能表进行预排序，然后保持该顺序，而仅 > APPEND 将整个列彼此相邻，而不是执行昂贵的JOIN.我可以设法为每个功能表保留所有相同的行列表.我希望没有加入或查找，因为我的ID集合是相同的.

JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.

这是如何实现的?据我了解，即使我按ID对每个表进行排序，Spark也会将它们分配用于存储，并且检索(如果我想查询回追加)不能保证具有相同的顺序.

How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.

将列从一个数据框添加到另一个而无需联接 [英] Add column from one dataframe to another WITHOUT JOIN

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将列从一个数据框添加到另一个而无需联接 [英] Add column from one dataframe to another WITHOUT JOIN

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭