如何使用来自蒙戈和PostgreSQL数据在内存中查找表？ [英] How to use data from Mongo and PostgreSQL as in-memory lookup tables?

I am using forEachPartition as suggested here to loop over a list of 10000 IDs, then I do a repartition(20) because every partition creates DB connection and if I create say 100 partitions, the job just dies because of 100 open connections to postgres and mongo. I use postgres connections not only to store data but to lookup some data from another table. I can get rid of storing the data to postgres directly from my task and do it as post processing from a sequence file.

But I ideally need to massively parallelize my spark job so that the task completed within a given time, currently it processed about 200 IDs in 20hrs, but I need to process 10000 IDs in 20hrs. So repartition(20) is clearly not helping. I am bound by IO on db here.

So what are my options where I can efficiently share this data across all tasks? I want data in mongo and postgres to be treated as in memory lookup tables - total size is about 500gb.

My options are:

RDD (I don't think RDD fits my usecase)
Dataframe
Broadcast variables (not sure of this will work as its creation needs 500gb available in the spark driver)
Move data from mongo to s3 and tasks lookup from s3.

解决方案

The techinique we follow for this kind of problem is to:

Store the lookups in different collection of MongoDB.
Using Hadoop MongoDB connector get the data from MongoDB and store it in an RDD
Broadcast the variable so as it will be available to all the Node/Worker
Now if the data is in HDFS create an RDD for it or if the data is in MongoDB using the Hadoop MongoDB connector.
Now perform the lookup matching part
Save the file as an Sequence file or you can also save it on S3 need to check on it as we store it back to MongoDB

这篇关于如何使用来自蒙戈和PostgreSQL数据在内存中查找表？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用来自蒙戈和PostgreSQL数据在内存中查找表？ [英] How to use data from Mongo and PostgreSQL as in-memory lookup tables?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何使用来自蒙戈和PostgreSQL数据在内存中查找表？ [英] How to use data from Mongo and PostgreSQL as in-memory lookup tables?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭