有没有一种方法可以释放xarray.Dataset的文件锁? [英] Is there a way to release the file lock for a xarray.Dataset?

查看:66
本文介绍了有没有一种方法可以释放xarray.Dataset的文件锁?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用 netcdf4.Dataset(fn,mode = a)每5分钟生成一个NetCDF文件 fn 的过程.我还使用 xarray.Dataset (我想保留它,因为它很方便)来对该NetCDF文件进行bokeh服务器可视化.

问题是,如果通过

fn 在我的bokeh服务器进程中打开了新数据,则尝试将其添加到 fn 时,NetCDF更新进程将失败.

  ds = xarray.open_dataset(fn) 

如果我使用选项自动关闭

  ds = xarray.open_dataset(fn,autoclose = True) 

在bokeh服务器应用程序中,当 ds 为打开"时,

使用其他过程更新 fn 可以,但是对bokeh图的更新将时间段从 fn ,非常懒惰.

我的问题:使用 xarray.Dataset 时,还有另一种方法来释放NetCDF文件的锁定吗?

我不在乎xarray.Dataset的形状是否仅在重新加载整个bokeh服务器应用程序后才持续更新.

谢谢!

这是一个最小的工作示例:

将其放入文件中并使其运行:

 导入时间从datetime导入datetime将numpy导入为np导入netCDF4fn ='my_growing_file.nc'使用netCDF4.Dataset(fn,'w')作为nc_fh:#创建尺寸nc_fh.createDimension('x',90)nc_fh.createDimension('y',90)nc_fh.createDimension('time',None)#创建变量nc_fh.createVariable('x','f8',('x'))nc_fh.createVariable('y','f8',('y'))nc_fh.createVariable('time','f8',('time'))nc_fh.createVariable('rainfall_amount','i2',(时间","y","x"),zlib = False,complevel = 0,fill_value = -9999,chunksizes =(1,90,90))nc_fh ['rainfall_amount'].scale_factor = 0.1nc_fh ['rainfall_amount'].add_offset = 0nc_fh.set_auto_maskandscale(真)#变量属性nc_fh ['time'].long_name ='时间'nc_fh ['time'].standard_name ='time'nc_fh ['time'].units ='自2000年1月1日以来的小时数00:50:00.0'nc_fh ['time'].calendar ='standard'对于我在范围(1000)中:使用netCDF4.Dataset(fn,'a')作为nc_fh:current_length = len(nc_fh ['time'])print('附加到NetCDF文件{}'.format(fn))print('时间长度向量:{}'.format(current_length))如果current_length>0:last_time_stamp = netCDF4.num2date(nc_fh ['time'] [-1],units = nc_fh ['time'].units,calendar = nc_fh ['time'].calendar)print('NetCDF中的最后一个时间戳:{}'.format(str(last_time_stamp)))别的:last_time_stamp ='1900-01-01'print('空文件,从头开始')nc_fh ['time'] [i] = netCDF4.date2num(datetime.utcnow(),units = nc_fh ['time'].units,calendar = nc_fh ['time'].calendar)nc_fh ['rainfall_amount'] [i,:,:] = np.random.rand(90,90)打印('睡觉... \ n')time.sleep(3) 

然后转到例如IPython并通过以下方式打开正在增长的文件:

  ds = xr.open_dataset('my_growing_file.nc') 

这将导致附加到NetCDF的进程失败,并显示以下输出:

 附加到NetCDF文件my_growing_file.nc时间长度向量:0空文件,从头开始睡眠...附加到NetCDF文件my_growing_file.nc时间长度向量:1NetCDF中的上一次时间戳记:2018-04-12 08:52:39.145999睡眠...附加到NetCDF文件my_growing_file.nc时间长度向量:2NetCDF中的上一次时间戳记:2018-04-12 08:52:42.159254睡眠...附加到NetCDF文件my_growing_file.nc时间长度向量:3NetCDF中的上一次时间戳记:2018-04-12 08:52:45.169516睡眠...---------------------------------------------------------------------------IOError跟踪(最近一次通话)< ipython-input-17-9950ca2e53a6>在< module>()中37i在范围(1000)中为38:--->39,其中netCDF4.Dataset(fn,'a')为nc_fh:40 current_length = len(nc_fh ['time'])41netCDF4._netCDF4.Dataset中的netCDF4/_netCDF4.pyx .__ init __()netCDF4._netCDF4._ensure_nc_success()中的netCDF4/_netCDF4.pyxIOError:[Errno -101] NetCDF:HDF错误:"my_growing_file.nc" 

如果使用

  ds = xr.open_dataset('my_growing_file.nc',autoclose = True) 

没有错误,但是通过 xarray 的访问时间当然会变慢,这正是我的问题,因为仪表板的可视化效果非常缓慢.

我可以理解,这可能不是 xarray 的预期用途,并且,如果需要,我将退回到 netCDF4 提供的较低层接口(希望它支持并发文件访问,至少用于读取),但是为了方便起见,我想保留 xarray .

解决方案

我在这里回答自己的问题,因为我找到了一种解决方案,或者更好地说,是通过Python中NetCDF的文件锁定来解决此问题的方法.

一个好的解决方案是使用 zarr 代替NetCDF文件想要持续增加文件中的数据集,同时保持打开状态,例如实时可视化.

幸运的是, xarray 现在还可以轻松地使用 将dask.array导入为da将xarray导入为xr将熊猫作为pd导入导入日期时间导入时间fn ='/tmp/my_growing_file.zarr'#创建一个虚拟数据集并将其写入zarr数据= da.random.random(大小=(30,900,1200),块=(10,900,1200))t = pd.date_range(end = datetime.datetime.utcnow(),周期= 30,freq ='1s')ds = xr.Dataset(data_vars = {'foo':(('time','y','x'),data)},coords = {'time':t},)#ds.to_zarr(fn,mode ='w',encoding = {'foo':{'dtype':'int16','scale_factor':0.1,'_FillValue':-9999}})#ds.to_zarr(fn,mode ='w',encoding = {'time':{'_FillValue':-9999}})ds.to_zarr(fn,mode ='w')#以较小的块追加新数据对于我在范围(100)中:打印(``睡了10秒钟...'')时间.睡眠(10)数据= 0.01 * i + da.random.random(大小=(7,900,1200),块=(7,900,1200))t = pd.date_range(end = datetime.datetime.utcnow(),period = 7,freq ='1s')ds = xr.Dataset(data_vars = {'foo':(('time','y','x'),data)},coords = {'time':t},)打印(f'将7个新的时间片附加最新的时间戳{t [-1]}')ds.to_zarr(fn,append_dim ='time')

然后您可以打开另一个Python进程,例如IPython和

  ds = xr.open_zarr('/tmp/my_growing_file.zarr/') 

一遍又一遍而不会导致编写器进程崩溃.

在此示例中,我使用了xarray verion 0.15.0和zarr版本2.4.0.

一些附加说明:

请注意,此示例中的代码有意地附加了一些小块,这些块不均匀地分割了zarr文件中的块大小,以了解这如何影响这些块.从我的测试中,我可以说zarr文件的最初选择的块大小得以保留,这很棒!

还请注意,由于 datetime64 数据已由 xarray 编码并存储为整数以符合NetCDF的CF约定,因此代码在添加时会生成警告.这也适用于zarr文件,但是目前看来 _FillValue 并没有自动设置.只要您的时间数据中没有 NaT ,就没关系了.

免责声明:我还没有尝试使用更大的数据集和长时间运行的过程来增长文件,因此,我无法评论最终性能下降或如果将奇怪的文件或其元数据从某种程度上分散下来可能会出现的其他问题这个过程.

I have a process that grows a NetCDF file fn every 5 minutes using netcdf4.Dataset(fn, mode=a). I also have a bokeh server visualization of that NetCDF file using a xarray.Dataset (which I want to keep, because it is so convenient).

The problem is that the NetCDF-update-process fails when trying to add new data to fn if it is open in my bokeh server process via

ds = xarray.open_dataset(fn)

If I use the option autoclose

ds = xarray.open_dataset(fn, autoclose=True)

updating fn with the other process while ds is "open" in the bokeh server app works, but the updates to the bokeh figure, which pull time slices from fn, get very laggy.

My question is: Is there another way to release the lock of the NetCDF file when using xarray.Dataset?

I would not care if the shape of the xarray.Dataset is only updated consistently after reloading the whole bokeh server app.

Thanks!

Here is a minimal working example:

Put this into a file and let it run:

import time
from datetime import datetime

import numpy as np
import netCDF4

fn = 'my_growing_file.nc'

with netCDF4.Dataset(fn, 'w') as nc_fh:
    # create dimensions
    nc_fh.createDimension('x', 90)
    nc_fh.createDimension('y', 90)
    nc_fh.createDimension('time', None)

    # create variables
    nc_fh.createVariable('x', 'f8', ('x'))
    nc_fh.createVariable('y', 'f8', ('y'))
    nc_fh.createVariable('time', 'f8', ('time'))
    nc_fh.createVariable('rainfall_amount',
                         'i2',
                         ('time', 'y', 'x'),
                         zlib=False,
                         complevel=0,
                         fill_value=-9999,
                         chunksizes=(1, 90, 90))
    nc_fh['rainfall_amount'].scale_factor = 0.1
    nc_fh['rainfall_amount'].add_offset = 0

    nc_fh.set_auto_maskandscale(True)

    # variable attributes
    nc_fh['time'].long_name = 'Time'
    nc_fh['time'].standard_name = 'time'
    nc_fh['time'].units = 'hours since 2000-01-01 00:50:00.0'
    nc_fh['time'].calendar = 'standard'

for i in range(1000):
    with netCDF4.Dataset(fn, 'a') as nc_fh:
        current_length = len(nc_fh['time'])

        print('Appending to NetCDF file {}'.format(fn))
        print(' length of time vector: {}'.format(current_length))

        if current_length > 0:
            last_time_stamp = netCDF4.num2date(
                nc_fh['time'][-1],
                units=nc_fh['time'].units,
                calendar=nc_fh['time'].calendar)
            print(' last time stamp in NetCDF: {}'.format(str(last_time_stamp)))
        else:
            last_time_stamp = '1900-01-01'
            print(' empty file, starting from scratch')

        nc_fh['time'][i] = netCDF4.date2num(
            datetime.utcnow(),
            units=nc_fh['time'].units,
            calendar=nc_fh['time'].calendar)
        nc_fh['rainfall_amount'][i, :, :] = np.random.rand(90, 90)

    print('Sleeping...\n')
    time.sleep(3)

Then, go to e.g. IPython and open the growing file via:

ds = xr.open_dataset('my_growing_file.nc')

This will cause the process that appends to the NetCDF to fail with an output like this:

Appending to NetCDF file my_growing_file.nc
 length of time vector: 0
 empty file, starting from scratch
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 1
 last time stamp in NetCDF: 2018-04-12 08:52:39.145999
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 2
 last time stamp in NetCDF: 2018-04-12 08:52:42.159254
Sleeping...

Appending to NetCDF file my_growing_file.nc
 length of time vector: 3
 last time stamp in NetCDF: 2018-04-12 08:52:45.169516
Sleeping...

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-17-9950ca2e53a6> in <module>()
     37 
     38 for i in range(1000):
---> 39     with netCDF4.Dataset(fn, 'a') as nc_fh:
     40         current_length = len(nc_fh['time'])
     41 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

IOError: [Errno -101] NetCDF: HDF error: 'my_growing_file.nc'

If using

ds = xr.open_dataset('my_growing_file.nc', autoclose=True)

there is no error, but access times via xarray of course get slower, which is exactly my problem since my dashboard visualization gets very laggy.

I can understand that this is maybe not the intended use for xarray and, if required, I will fall back to the lower level interface provided by netCDF4 (hoping that it supports concurrent file access, at least for reads), but I would like to keep xarray for its convenience.

解决方案

I am answering my own question here because I found a solution, or better said, a way around this problem with the file lock of NetCDF in Python.

A good solution is to use zarr instead of NetCDF files if you want to continuously grow a dataset in a file while keeping it open for, e.g. a real-time visualization.

Luckily xarray now also easily allows to append data to an existing zarr file along a selected dimension using the append_dim keyword argument, thanks to a recently merged PR.

The code for using zarr, instead of NetCDF like in my question, looks like this:


import dask.array as da
import xarray as xr
import pandas as pd
import datetime
import time

fn = '/tmp/my_growing_file.zarr'

# Creat a dummy dataset and write it to zarr
data = da.random.random(size=(30, 900, 1200), chunks=(10, 900, 1200))
t = pd.date_range(end=datetime.datetime.utcnow(), periods=30, freq='1s')
ds = xr.Dataset(
    data_vars={'foo': (('time', 'y', 'x'), data)},
    coords={'time': t},
)
#ds.to_zarr(fn, mode='w', encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue':-9999}})
#ds.to_zarr(fn, mode='w', encoding={'time': {'_FillValue': -9999}})
ds.to_zarr(fn, mode='w')

# Append new data in smaller chunks
for i in range(100):
    print('Sleeping for 10 seconds...')
    time.sleep(10)

    data = 0.01 * i + da.random.random(size=(7, 900, 1200), chunks=(7, 900, 1200))
    t = pd.date_range(end=datetime.datetime.utcnow(), periods=7, freq='1s')
    ds = xr.Dataset(
        data_vars={'foo': (('time', 'y', 'x'), data)},
        coords={'time': t},
    )
    print(f'Appending 7 new time slices with latest time stamp {t[-1]}')
    ds.to_zarr(fn, append_dim='time')

You can then open another Python process, e.g. IPython and do

 ds = xr.open_zarr('/tmp/my_growing_file.zarr/')   

over and over again without crashing the writer process.

I used xarray verion 0.15.0 and zarr version 2.4.0 for this example.

Some additional note:

Note that the code in this example deliberately appends in small chunks that unevenly divide the chunk size in the zarr file to see how this affects the chunks. From my tests I can say that the initially chosen chunk size of the zarr file is preserved, which is great!

Also note that the code generates a warning when appending because the datetime64 data is encoded and stored as integer by xarray to comply with the CF conventions for NetCDF. This also works for zarr files, but currently it seems that the _FillValue is not automatically set. As long as you do not have NaT in your time data this should not matter.

Disclaimer: I have not yet tried this with a larger dataset and a long-running process which grows the file, so I cannot comment on eventual performance degradation or other problems that might occur if zarr files or its metadata get somehow fragmented from this process.

这篇关于有没有一种方法可以释放xarray.Dataset的文件锁?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆