Python:处理大量数据. Scipy还是Rpy?如何? [英] Python: handling a large set of data. Scipy or Rpy? And how?

查看:334
本文介绍了Python:处理大量数据. Scipy还是Rpy?如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的python环境中,已经安装了Rpy和Scipy软件包.

In my python environment, the Rpy and Scipy packages are already installed.

我要解决的问题是这样的:

The problem I want to tackle is such:

1)大量财务数据存储在一个文本文件中.无法加载到Excel

1) A huge set of financial data are stored in a text file. Loading into Excel is not possible

2)我需要对某些字段求和并得到总计.

2) I need to sum a certain fields and get the totals.

3)我需要显示基于总数的前10行.

3) I need to show the top 10 rows based on the totals.

哪个软件包(Scipy或Rpy)最适合此任务?

Which package (Scipy or Rpy) is best suited for this task?

如果是这样,您能为我提供一些指针(例如文档或在线示例)来帮助我实施解决方案吗?

If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?

速度是一个问题.理想情况下,即使文件很大而无法将它们容纳到内存中,scipy和Rpy仍可以处理大文件

Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory

推荐答案

正如@ gsk3所指出的,bigmemory是一个很好的软件包,同时还有biganalyticsbigtabulate软件包(还有更多,但这些值得一试).还有ff,尽管使用起来并不容易.

As @gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.

R和Python的共同点是对HDF5的支持(请参阅R中的ncdf4NetCDF4软件包),这使得它非常快速且易于访问磁盘上的海量数据集.我个人主要使用bigmemory,尽管这是R特有的.由于HDF5在Python中可用并且非常非常快,因此它可能是您在Python中最好的选择.

Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.

这篇关于Python:处理大量数据. Scipy还是Rpy?如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆