如何使用来自蒙戈和PostgreSQL数据在内存中查找表? [英] How to use data from Mongo and PostgreSQL as in-memory lookup tables?
问题描述
它的这个问题的延续:<一href=\"http://stackoverflow.com/questions/32276856/porting-a-multi-threaded-compute-intensive-job-to-spark\">Porting一个多线程的计算密集型任务的火花
我使用 forEachPartition
作为这里遍历建议 10000 ID的列表,然后我做了重新分区(20)
,因为每个分区创建数据库连接,如果我创造说100个分区,这份工作只是死了,因为100的开放连接给Postgres和蒙戈。我使用Postgres连接不仅可以存储数据,但是从另一个表查询一些数据。
我可以摆脱直接从我的任务数据存储到Postgres的,做它作为后处理从一个序列文件。
但我的理想需要大规模并行我的火花的工作,所以在给定的时间内完成了任务,但目前处理约200个ID在20小时,但我需要处理10000标识的20小时。因此,重新分区(20)
显然没有帮助。我对IO的分贝这里的约束。
那么,什么是我的选择,我可以高效地共享所有的任务这些数据?我想在蒙戈和Postgres的数据被视为在内存中查找表 - 总大小约为500GB。
。我的选择是:
- RDD(我不认为RDD符合我的用例)
- 数据框
- 广播变量(如它的创作需要的火花驱动程序提供500GB不能肯定这将工作)
- 将来自S3蒙戈到S3和任务查找数据。
我们遵循这种问题的techinique是:
- 存储不同的集合的MongoDB中查找。
- 使用Hadoop连接器的MongoDB MongoDB的从获得数据并将其存储在一个RDD
- 广播的变量,因为它会提供给所有的节点/工人
- 现在,如果数据是在HDFS创建RDD为它或者如果数据是在使用的MongoDB Hadoop的MongoDB的连接器。
- 现在,执行查找匹配的部分
- 将文件另存为一个序列文件或者你也可以将它保存在S3上需要检查它作为我们存储回的MongoDB
Its a continuation of this question: Porting a multi-threaded compute intensive job to spark
I am using forEachPartition
as suggested here to loop over a list of 10000 IDs, then I do a repartition(20)
because every partition creates DB connection and if I create say 100 partitions, the job just dies because of 100 open connections to postgres and mongo. I use postgres connections not only to store data but to lookup some data from another table.
I can get rid of storing the data to postgres directly from my task and do it as post processing from a sequence file.
But I ideally need to massively parallelize my spark job so that the task completed within a given time, currently it processed about 200 IDs in 20hrs, but I need to process 10000 IDs in 20hrs. So repartition(20)
is clearly not helping. I am bound by IO on db here.
So what are my options where I can efficiently share this data across all tasks? I want data in mongo and postgres to be treated as in memory lookup tables - total size is about 500gb.
My options are:
- RDD (I don't think RDD fits my usecase)
- Dataframe
- Broadcast variables (not sure of this will work as its creation needs 500gb available in the spark driver)
- Move data from mongo to s3 and tasks lookup from s3.
The techinique we follow for this kind of problem is to:
- Store the lookups in different collection of MongoDB.
- Using Hadoop MongoDB connector get the data from MongoDB and store it in an RDD
- Broadcast the variable so as it will be available to all the Node/Worker
- Now if the data is in HDFS create an RDD for it or if the data is in MongoDB using the Hadoop MongoDB connector.
- Now perform the lookup matching part
- Save the file as an Sequence file or you can also save it on S3 need to check on it as we store it back to MongoDB
这篇关于如何使用来自蒙戈和PostgreSQL数据在内存中查找表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!