PySpark中的内存高效笛卡尔式联接 [英] Memory efficient cartesian join in PySpark

查看:200
本文介绍了PySpark中的内存高效笛卡尔式联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的字符串ID数据集,可以容纳到我的Spark集群中单个节点上的内存中.问题在于它占用了单个节点的大部分内存.

I have a large dataset of string ids, that can fit into memory on a single node in my spark cluster. The issue is that it consumes most of the memory for a single node.

这些ID大约30个字符长.例如:

These ids are about 30 characters long. For example:

ids
O2LWk4MAbcrOCWo3IVM0GInelSXfcG
HbDckDXCye20kwu0gfeGpLGWnJ2yif
o43xSMBUJLOKDxkYEQbAEWk4aPQHkm

我正在寻找写入所有ID对列表的文件.例如:

I am looking to write to file a list of all of the pairs of ids. For example:

id1,id2
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,HbDckDXCye20kwu0gfeGpLGWnJ2yif
O2LWk4MAbcrOCWo3IVM0GInelSXfcG,o43xSMBUJLOKDxkYEQbAEWk4aPQHkm
HbDckDXCye20kwu0gfeGpLGWnJ2yif,O2LWk4MAbcrOCWo3IVM0GInelSXfcG
# etc...

所以我需要交叉连接数据集本身.我希望使用10个节点的群集在PySpark上执行此操作,但是它必须提高内存效率.

So I need to cross join the dataset on itself. I was hoping to do this on PySpark using a 10 node cluster, but it needs to be memory efficient.

推荐答案

pySpark将轻松处理您的数据集并提高内存效率,但是处理10 ^ 8 * 10 ^ 8条记录将需要时间(这是交叉联接结果的估计大小) ).查看示例代码:

pySpark will handle your dataset easily and memory efficient but it will take time to process 10^8 * 10^8 records (this is estimated size of cross join result). See sample code:

from pyspark.sql.types import *
df = spark.read.csv('input.csv', header=True, schema=StructType([StructField('id', StringType())]))
df.withColumnRenamed('id', 'id1').crossJoin(df.withColumnRenamed('id', 'id2')).show()

这篇关于PySpark中的内存高效笛卡尔式联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆