Python:快速高效地编写大文本文件的方式 [英] Python: Fast and efficient way of writing large text file

查看:251
本文介绍了Python:快速高效地编写大文本文件的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于python的速度/效率相关问题:



我需要编写大量非常大的R dataframe-ish文件,大约0.5-2 GB尺寸。这基本上是一个大的标签分隔表,每行可以包含浮点数,整数和字符串。



通常,我将所有数据放在numpy数据框中,并使用np.savetxt保存它,但由于有不同的数据类型,它不能真正被放入一个数组。



因此,我已经采取简单地组合行字符串手动,但这是一个很慢的。到目前为止我在做:



1)将每行作为一个字符串组合
2)将所有行连接成单个巨大的字符串
3)写入字符串到文件



我有几个问题:
1)大量的字符串连接最终花了很多时间
2 )我运行RAM来保持内存中的字符串
3)...这又导致更多的单独的file.write命令,这也是非常慢的。



所以我的问题是:这个问题有什么好的例程?一个平衡速度与内存消耗最有效的字符串连接和写入磁盘。



...或者这个策略只是坏了,我应该做一些完全不同的东西?



提前感谢

解决方案

熊猫可能是这个问题的好工具。开始使用大熊猫很容易,而且您可能需要将数据转换成python的方式很好。大熊猫处理好混合数据(浮点数,整数,字符串),通常可以自己检测类型。



一旦你有一个(类似R)的数据帧在大熊猫中,将框架输出到csv非常简单。

  DataFrame.to_csv(path_or_buf,sep ='\t')
/ pre>

有一些其他配置可以让您的标签分隔文件正确。



http://pandas.pydata.org/pandas-docs/stable/generated /pandas.DataFrame.to_csv.html


I have a speed/efficiency related question about python:

I need to write a large number of very large R dataframe-ish files, about 0.5-2 GB sizes. This is basically a large tab-separated table, where each line can contain floats, integers and strings.

Normally, I would just put all my data in numpy dataframe and use np.savetxt to save it, but since there are different data types it can't really be put into one array.

Therefore I have resorted to simply assembling the lines as strings manually, but this is a tad slow. So far I'm doing:

1) Assemble each line as a string 2) Concatenate all lines as single huge string 3) Write string to file

I have several problems with this: 1) The large number of string-concatenations ends up taking a lot of time 2) I run of of RAM to keep strings in memory 3) ...which in turn leads to more separate file.write commands, which are very slow as well.

So my question is: What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient string-concatenation and writing to disk.

... or maybe this strategy is simply just bad and I should do something completely different?

Thanks in advance!

解决方案

Seems like Pandas might be a good tool for this problem. It's pretty easy to get started with pandas, and it deals well with most ways you might need to get data into python. Pandas deals well with mixed data (floats, ints, strings), and usually can detect the types on its own.

Once you have an (R-like) data frame in pandas, it's pretty straightforward to output the frame to csv.

DataFrame.to_csv(path_or_buf, sep='\t')

There's a bunch of other configuration things you can do to make your tab separated file just right.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

这篇关于Python:快速高效地编写大文本文件的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆