使用Pandas和PyMongo将MongoDB数据加载到DataFrame的更好方法? [英] A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

查看:312
本文介绍了使用Pandas和PyMongo将MongoDB数据加载到DataFrame的更好方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个0.7 GB的MongoDB数据库,其中包含要尝试加载到数据帧中的推文.但是,我得到一个错误.

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

MemoryError:    

我的代码如下:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

我已经尝试了以下答案中的方法,这些方法有时会在加载数据库之前创建数据库所有元素的列表.

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

  • https://stackoverflow.com/a/17805626/2297475
  • https://stackoverflow.com/a/16255680/2297475

但是,在另一个有关list()的答案中,此人表示这对小型数据集非常有用,因为所有内容都已加载到内存中.

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

就我而言,我认为这是错误的根源.太多数据无法加载到内存中.我还能使用什么其他方法?

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

推荐答案

我已将代码修改为以下内容:

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

通过在f​​ind()函数中添加 fields 参数,我限制了输出.这意味着我没有将每个字段都加载,而是仅将所选字段加载到DataFrame中.现在一切正常.

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

这篇关于使用Pandas和PyMongo将MongoDB数据加载到DataFrame的更好方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆