vdal的gdal ReadAsarray非常慢 [英] gdal ReadAsarray for vrt extremely slow

查看:132
本文介绍了vdal的gdal ReadAsarray非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图简单地减去两个栅格并将结果保存在另一个栅格中.输入图像之一是tif文件,另一个是vrt文件. (输出为tif)

I am trying to simply subtract two rasters and save the result in another raster. One of the input images is a tif-file, the other is a vrt-file. (Output is tif)

文件很大,因此我将它们打开,将它们分成小块,逐个遍历,然后减去. 问题是,它非常慢!

The files are very big, so I open them, divide them into tiles and run through each of them and then subtracting. The problem is that it is extremely slow!

import gdal
import numpy as np

rA = gdal.Open(raFileName)
rB = gdal.Open(rbFileName)

nodataA = rA.GetRasterBand(1).GetNoDataValue()
nodataB = rB.GetRasterBand(1).GetNoDataValue()


raster = gdal.GetDriverByName('GTiff').Create(outputFileName, ncols, nrows, 1 ,gdal.GDT_Float32,['COMPRESS=LZW','BigTIFF=YES'])

# tile size
trows = 5000
tcols = 5000
# number of tiles in output file   (ceil)
ntrows = (nrows-1)/trows+1
ntcols = (ncols-1)/tcols+1

# tiling because of memory problems
for r in range(ntrows):
    for c in range(ntcols):
        # number of rows/cols for tile r (in case of edge)
        rtrows = min(trows,nrows-r*trows)
        ctcols = min(tcols,ncols-c*tcols)

        # the data from the input files
        rA_arr = rA.GetRasterBand(1).ReadAsArray(c*tcols,r*trows,ctcols,rtrows)
        rB_arr = rB.GetRasterBand(1).ReadAsArray(c*tcols,r*trows,ctcols,rtrows)

        mask = np.logical_or(rA_arr==nodataA,rB_arr==nodataB)

        subtraction = rA_arr-rB_arr
        subtraction[mask] = nodata

        # writing this tile to the output raster
        rasterBand.WriteArray(subtraction,c*tcols,r*trows)            
        raster.FlushCache()

我当前要减去的栅格有16 * 16个图块(5000 * 5000像素),大约两个小时后,它仅经历了3行!

The rasters I am currently trying to subtract have 16*16 tiles (of 5000*5000 pxls) and after about two hours, it has only gone through 3 rows!

有什么方法可以提高性能?

推荐答案

晚些时候参加聚会,但这是我根据罗格(Rutger)的出色回答编写的脚本.它以某种方式优化了磁贴大小,以便您读取尽可能少的块.几乎可以肯定,这不是您能做的最好的事情,但是我注意到在处理大小为[1440000 560000]的地理栅格时,它极大地改善了我的运行时.这里有两个函数:optimum_tile_size和split_raster_into_tiles.后者将提供给定输入栅格对象的(优化的w.r.t.块大小)图块的坐标.

Late to the party, but here is a script I wrote based on Rutger's excellent answer. It somehow optimizes the tile size so that you read the fewest blocks possible. It's almost assuredly not the best you can do, but I noticed it vastly improved my runtimes when processing geo-rasters of size [1440000 560000]. There are two functions here: optimal_tile_size and split_raster_into_tiles. The latter will provide the coordinates of the (optimized w.r.t. blocksize) tiles for a given input raster object.

import numpy as np
import gdal

def split_raster_into_tiles(rast_obj, N, overlap=0, aspect=0):

    """
    Given an input raster object (created via gdal) and the number of tiles
    you wish to break it into, returns an Nx4 array of coordinates corresponding
    to the corners of boxes that tile the input. Tiles are created to minimize
    the number of blocks read from the raster file. An optional
    overlap value will adjust the tiles such that they each overlap by the 
    specified pixel amount.

    INPUTS:
        rast_obj - obj: gdal raster object created by gdal.Open()
               N - int: number of tiles to split raster into
         overlap - int: (optional) number of pixels each tile should overlap
 aspect - float or str: (optional) if set to 0, aspect is not considered. If 
                        set to 'use_raster', aspect of raster is respected.
                        Otherwise can provide an aspect = dx/dy.

    OUTPUTS:
           tiles - np array: Nx4 array where N is number of tiles and the four
                             columns correspond to: [xmin, xmax, ymin, ymax]
    """

    # Get optimal tile size
    dx, dy = optimal_tile_size(rast_obj, N, aspect=aspect)

    ncols = rast_obj.RasterXSize
    nrows = rast_obj.RasterYSize

    # Generate nonoverlapping tile edge coordinates
    ulx = np.array([0])
    uly = np.array([0])
    while ulx[-1] < ncols:
        ulx = np.append(ulx, ulx[-1]+dx)
    while uly[-1] < nrows:
        uly = np.append(uly, uly[-1]+dy)

    # In case of overshoots, remove the last points and replace with raster extent
    ulx = ulx[:-1] 
    uly = uly[:-1]
    ulx = np.append(ulx, ncols)
    uly = np.append(uly, nrows)

    # Ensure final tile is large enough; if not, delete second-from-last point
    if ulx[-1] - ulx[-2] < 2*overlap:
       ulx = np.delete(ulx, -2)
    if uly[-1] - uly[-2] < 2*overlap:
       uly = np.delete(uly, -2)

    # Create tiles array where each row corresponds to [xmin, xmax, ymin, ymax]
    tiles = np.empty(((len(ulx)-1)*(len(uly)-1),4), dtype=int)
    rowct = 0
    for i in np.arange(0,len(ulx[:-1])):
        for j in np.arange(0,len(uly[:-1])):
            tiles[rowct,0] = ulx[i]
            tiles[rowct,1] = ulx[i+1]
            tiles[rowct,2] = uly[j]
            tiles[rowct,3] = uly[j+1]
            rowct = rowct + 1

    # Adjust tiles for overlap
    if overlap > 0:
        tiles[tiles[:,0] > overlap, 0] = tiles[tiles[:,0] > overlap, 0] - overlap
        tiles[tiles[:,1] < (ncols - overlap), 1] =  tiles[tiles[:,1] < (ncols - overlap), 1] + overlap
        tiles[tiles[:,2] > overlap, 2] = tiles[tiles[:,2] > overlap, 2] - overlap
        tiles[tiles[:,3] < (nrows - overlap), 3] = tiles[tiles[:,3] < (nrows - overlap), 3] + overlap

    print('Tile size X, Y is {}, {}.'.format(dx, dy))

    return tiles


def optimal_tile_size(rast_obj, N, aspect=0):

    """
    Returns a tile size that optimizes reading a raster by considering the 
    blocksize of the raster. The raster is divided into (roughly) N tiles. If
    the shape of the tiles is unimportant (aspect=0), optimization 
    considers only the blocksize. If an aspect ratio is provided, optimization
    tries to respect it as much as possible.

    INPUTS:
        rast_obj - obj: gdal raster object created by gdal.Open()
               N - int: number of tiles to split raster into
         aspect  - float or str: (optional) - If no value is provided, the 
                   aspect ratio is set only by the blocksize. If aspect is set
                   to 'use_raster', aspect is obtained from the aspect of the
                   given raster. Optionally, an aspect may be provided where 
                   aspect = dx/dy.

    OUTPUTS:
              dx -  np.int: optimized number of columns of each tile
              dy -  np.int: optimized number of rows of each tile

    """

#    # If a vrt, try to get the underlying raster blocksize
#    filepath = rast_obj.GetDescription()
#    extension = filepath.split('.')[-1]
#    if extension == 'vrt':
#        sample_tif = rast_obj.GetFileList()[-1]
#        st = gdal.Open(sample_tif)
#        blocksize = st.GetRasterBand(1).GetBlockSize()
#    else:
#        blocksize = rast_obj.GetRasterBand(1).GetBlockSize()

    blocksize = rast_obj.GetRasterBand(1).GetBlockSize()

    ncols = rast_obj.RasterXSize
    nrows = rast_obj.RasterYSize

    # Compute ratios for sizing           
    totalpix = ncols * nrows
    pix_per_block = blocksize[0] * blocksize[1]
    pix_per_tile = totalpix / N

    if aspect == 0: # optimize tile size for fastest I/O

        n_blocks_per_tile = np.round(pix_per_tile / pix_per_block)

        if n_blocks_per_tile >= 1:
            # This assumes the larger dimension of the block size should be retained for sizing tiles
            if blocksize[0] > blocksize[1] or blocksize[0] == blocksize[1]:
                dx = blocksize[0]
                dy = np.round(pix_per_tile / dx)
                ndy = dy / nrows
                if ndy > 1.5:
                    dx = dx * np.round(ndy)
                dy = np.round((pix_per_tile / dx) / blocksize[1]) * blocksize[1]
                dy = np.min((dy, nrows))
                if dy == 0:
                    dy = blocksize[1]
            else: 
                dy = blocksize[1]
                dx = np.round(pix_per_tile / dy)
                ndx = dx / ncols
                if ndx > 1.5:
                    dy = dy * np.round(ndx)
                dx = np.round((pix_per_tile / dy) / blocksize[0]) * blocksize[0]
                dx = np.min((dx, ncols))
                if dx == 0:
                    dx = blocksize[0]

        else: 
            print('Block size is smaller than tile size; setting tile size to block size.')
            dy = blocksize[0]
            dx = blocksize[1]

    else: # optimize but respect the aspect ratio as much as possible

        if aspect == 'use_raster':
             aspect = ncols / nrows

        dya = np.round(np.sqrt(pix_per_tile / aspect))
        dxa = np.round(aspect * dya)

        dx = np.round(dxa / blocksize[0]) * blocksize[0]
        dx = np.min((dx, ncols))

        dy = np.round(dya / blocksize[1]) * blocksize[1]
        dy = np.min((dy, nrows))

        # Set dx,dy to blocksize if they're zero
        if dx == 0:
            dx = blocksize[0]
        if dy == 0:
            dy = blocksize[1]

    return dx, dy

作为快速基准测试,我尝试将42000 x 36000虚拟栅格(带有一些额外的计算)拼贴到30个图块中.启用优化后,运行时间为120秒.没有它,运行时间为596秒.如果您要平铺大文件,那么值得考虑使用块大小.

As a quick benchmark, I tried tiling a 42000 x 36000 virtual raster (with some extra calculations) into 30 tiles. With optimization turned on, runtime was 120 seconds. Without it, runtime was 596 seconds. If you're going to tile large files, taking blocksize into account will be worth your while.

这篇关于vdal的gdal ReadAsarray非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆