Efficent方式分为蟒大文本文件 [英] Efficent way to split a large text file in python

查看:95
本文介绍了Efficent方式分为蟒大文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个previous <一href="http://stackoverflow.com/questions/15223749/improving-performance-of-a-function-in-python#comment21464492_15223749">question其中,以改善功能的蟒蛇时的表现我需要找到一种有效的方式分割我的文本文件

this is a previous question where to improve the time performance of a function in python i need to find an efficient way to split my text file

我有以下的文本文件(超过32 GB)的没有排序

I have the following text file (more than 32 GB) not sorted

....................
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 272 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 274 593868.39 6734999.90 124.44 1,
0 275 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
.............................

,第一和第二列有ID(例如:0 -273)的X,Y,Z点的位置的网格中的

the first and second columns are the ID (ex: 0 -273) of location of the x,y,z point in a grid.

def point_grid_id(x,y,minx,maxy,distx,disty):
    """give id (row,col)"""
    col = int((x - minx)/distx)
    row = int((maxy - y)/disty)
    return (row, col)

(疯丫头,MAXX)是我的网格大小为原点 distx,disty 。 Id的瓷砖的数字是

the (minx, maxx) is the origin of my grid with size distx,disty. The the numbers of Id tiles are

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns 
from [(0,0),(0,1),(0,2),...,(ny-1,nx-1)]
n = len(tiles_id)

我要切片的〜32 GB的文件N(= LEN(tiles_id))数字文件。

我能做到这一点不排序,但读了n次的文件。出于这个原因,我希望能够找到一个有效的迭代法文件的起始形式(0,0)(= tiles_id [0])。从那以后,我只能读取一次分割后的文件。

i can do this without sorting but reading n times the file. For this reason I wish to find an efficient splitting method for the file starting form (0,0) (= tiles_id[0]) . After that i can read only one time the splitted files.

推荐答案

排序是几乎不可能的32GB文件,不管你使用Python或命令行工具(排序)。数据库显得太强大,但也可以使用。但是,如果你不愿意使用数据库,我建议只使用瓦片ID拆分源文件中的文件。

Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.

您读取一行,使一个文件名了瓷砖的ID并追加行的文件。并继续进行,直到源文件结束。它不会太快,但至少有一个为O(N)复杂度不同的排序。

You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.

和,当然,实现文件的个人的排序和连接起来将是可能的。在整理一个32GB文件中的主要瓶颈应该是内存,而不是CPU。

And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.

这是,我认为:

def temp_file_name(l):
    id0, id1 = l.split()[:2]
    return "tile_%s_%s.tmp" % (id0, id1)

def split_file(name):
    ofiles = {}
    try:
        with open(name) as f:
            for l in f:
                if l:
                    fn = temp_file_name(l)
                    if fn not in ofiles:
                        ofiles[fn] = open(fn, 'w')
                    ofiles[fn].write(l)
    finally:
        for of in ofiles.itervalues():
            of.close()

split_file('srcdata1.txt')

但是,如果有很多的瓷砖,超过文件数就可以打开,你可以这样做:

But if there is a lot of tiles, more than number of files you can open, you may do so:

def split_file(name):
    with open(name) as f:
        for l in f:
            if l:
                fn = temp_file_name(l)
                with open(fn, 'a') as of:
                    of.write(l)

和最完美的办法就是关闭一些文件和深远的打开文件数量的限制后,从词典中删除它们。

And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.

这篇关于Efficent方式分为蟒大文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆