存储大量数据最聪明的方式 [英] Smartest way to store huge amounts of data

查看:172
本文介绍了存储大量数据最聪明的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用REST请求访问Flickr的API和下载大约元数据。 1神达照片(也许更多)。
我想将它们存储在一个.csv文件,并将其导入然后进入MySQL数据库进行进一步的处理。

I want to access the flickr API with a REST request and download the Metadata of approx. 1 Mio photos (maybe more). I want to store them in a .csv file and import them then into a MySQL Database for further processing

我想知道什么是处理这种大数据的最聪明的方式。什么我不知道的是如何将它们存储访问网站的Python中,将它们传递给.csv文件,并从那里到数据库后。那是一个很大的问号。

I am wondering what is the smartest way to handle such big data. What I am not sure about is how to store them after accessing the website in Python, passing them to the .csv file and from there to the db. Thats one big questionmark.

请告诉我现在发生的事情(我的理解,见code以下)是一个字典是每个创建照片(250元称为URL)。这样,我将结束与尽可能多的词典作为照片(1百万或更多)。那可能吗?
所有这些词典将被添加到列表中。我可以有很多字典追加到一个列表?我想在字典添加到列表中的唯一原因是因为它似乎更容易的方式从一个列表保存,每鳞次栉比,为.csv文件。

Whats happening now (for my understanding, see code below) is that a dictionary is created for every photo (250 per called URL). This way I would end up with as many dictionaries as photos (1 Mio or more). Is that possible? All these dictionaries will be appended to a list. Can I append that many dictionaries to a list? The only reason I want to append the dictionaries to the list is because it seems way easier to save from a list, row per row, to a .csv file.

你应该知道的是,我是一个初学者到编程,Python或都没有。我的专业是完全不同的一个,我刚开始学。如果您需要任何进一步的解释,请让我知道!

What you should know is that I am a complete beginner to programming, python or what so ever. My profession is a completely different one and I just started to learn. If you need any further explanations please let me know!

#accessing website
list = []
url = "https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5...1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description"
soup = BeautifulSoup(urlopen(url)) #soup it up
for data in soup.find_all('photo'):
    dict = {
        "id": data.get('id'),
        "title": data.get('title'),
        "tags": data.get('tags'),
        "latitude": data.get('latitude'),
        "longitude": data.get('longitude'),
    }
print (dict)

list.append(dict)

我使用Python 3.3的工作。我为什么不通过数据直接进入数据库的原因是因为我无法获得MySQL数据库蟒蛇连接器在我的OS X 10.6运行。

I am working with python 3.3. The reason why I do not pass the data direct into the db is because I cannot get the python connecter for mysql db on my os x 10.6 to run.

任何帮助是非常AP preciated。
谢谢乡亲!

Any help is very appreciated. Thank you folks!

推荐答案

我推荐使用SQLite,让这个原型而不是搞乱的CSV。 SQLite的工作得很好,但蟒蛇你不必去通过建立一个单独的数据库中所有的头痛。

I recommend using SQLite for prototyping this rather than messing with CSV. SQLite works very well with python but you don't have to go through all the headache of setting up a separate database.

另外,我不认为你想使用BeautifulSoup对于这一点,因为它听起来并不像刮是你真正想要的。这听起来像你想直接访问REST API。对于你要使用类似
请求库
或者更好
Flickr的Python绑定中的一个。

Also, I don't think you want to use BeautifulSoup for this, since it doesn't sound like scraping is what you really want. It sounds like you want to access the REST API directly. For that you'll want to use something like the requests library or better yet one of the flickr Python bindings.

一旦你的启动和运行,我会在每次循环中写入到数据库,保存为你去。你没有使用这种方式万吨内存,如果崩溃的东西你不会失去到目前为止,您已经将这些数据。

Once you have that up and running, I would write to the DB during each iteration of the loop, saving as you go. That way you're not using tons of memory and if something crashes you don't lose the data you've pulled so far.

这篇关于存储大量数据最聪明的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆