从netCDF更快地阅读时间序列? [英] Faster reading of time series from netCDF?

查看:352
本文介绍了从netCDF更快地阅读时间序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些大的netCDF文件,包含0.5度分辨率的地球的6小时数据。



每年有360个纬度点,720个经度点和1420个时间点。我有两个年度文件(12 GB ea)和一个110年数据(1.3 TB)的文件存储为netCDF-4(这里是1901数据的一个例子, 1901.nc ,其使用政策,以及我开始使用的原始公共文件)。



根据我的理解,从一个netCDF文件读取而不是循环遍历一组文件按年份和最初提供的变量分隔



我想提取每个网格点的时间序列,例如距离特定纬度和经度10或30年。但是,我发现这很慢。作为一个例子,从点位置读取10个值需要0.01秒,尽管我可以在0.002秒内从单个时间点读取10000个值的全局切片(维度的顺序是lat,lon,时间):

  ##一个地点10点的时间序列:
library(ncdf4)
met.nc< - nc_open(1901.nc)
system.time(a< - ncvar_get(met.nc,lwdown,start = c(100,100,1),
count = c(1,1,10)))
用户系统已用完
0.001 0.000 0.090

##关闭会话

## a全局切片10k点从一次
库(ncdf4)
system.time(met.nc< - nc_open(1901.nc))
system.time(a< - ncvar_get(met.nc,lwdown,start = c(100,100,1),
count = c(100,100,1)))
用户系统已用完
0.002 0.000 0.002

我怀疑这些已编写文件以优化空间层的读取,因为a)变量的顺序是lat,lon,time,b)这将是生成这些文件的气候模型的逻辑顺序和c)因为全局范围是最常见的可视化。



我试图重新排序变量,以便时间先到时:

  ncpdq -a time,lon,lat 1901.nc 1901_time.nc 

ncpdq 来自 NCO(netCDF运营商)软件

 > library(ncdf4)

##首先使用原始数据集:
> system.time(met.nc< - nc_open(test / 1901.nc))
用户系统已用完
0.024 0.045 22.334
> system.time(a< - ncvar_get(met.nc,lwdown,start = c(100,100,1),count = c(1,1,1000))
+)
用户系统已过去
0.005 0.027 14.958

##现在重新排列尺寸:
> system.time(met_time.nc< - nc_open(test / 1901_time.nc))
用户系统已用完
0.025 0.041 16.704
> system.time(a< - ncvar_get(met_time.nc,lwdown,start = c(100,100,1),count = c(1,1,1000)))
用户系统已用完
0.001 0.019 9.660

如何优化一个点的读取时间序列而不是一个大区域的层数时间点?例如,如果文件的编写方式不同,例如time,lat,lon,会更快吗?是否有一种简单的方法来转换netCDF-4文件中的维度顺序?



更新



(@mdsumner要求的基准)

  library(rbenchmark)
library(ncdf4)
nc< ; - nc_open(1901.nc)
基准(timeseries = ncvar_get(nc,lwdown,
start = c(1,1,50),
count = c(10 ,10,100)),
spacechunk = ncvar_get(nc,lwdown,
start = c(1,1,50),
count = c(100,100,1) ),
复制= 1000)

测试复制过去相对user.self sys.self user.child
2 spacechunk 1000 0.909 1.000 0.843 0.066 0
1时间序列1000 2.211 2.432 1.103 1.105 0
sys.child
2 0
1 0



更新2:



我已经开始在这里开发解决方案了。这些碎片位于 github的一组脚本中.com / ebimodeling / model-drivers / tree / master / met / cruncep



脚本仍需要一些工作和组织 - 并非所有脚本都是有用。但读取很快就会闪电。与上述结果不完全相同,但在一天结束时,我可以立即从1.3TB文件(0.5度分辨率,2.5秒)中读取100年,6小时的时间序列:

  system.time(ts<  -  ncvar_get(met.nc,lwdown,start = c(50,1,1),count = c( 160000,1,1)))
用户系统已用完
0.004 0.000 0.004

注意:维度的顺序已更改,如下所述:如何在使用时指定维度顺序ncdf4 :: ncvar_get?

解决方案

我认为这个问题的答案不会那么多对数据进行排序,因为它将分块数据。有关分块netCDF文件含义的完整讨论,请参阅Unidata的主要netCDF开发人员Russ Rew的以下博客文章:




  • < a href =http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters =noreferrer>分块数据:为何重要

  • 分块数据:选择形状



结果是,采用不同的分块策略可以大大提高访问速度,选择正确的策略并非易事。 / p>

在较小的样本数据集 sst.wkmean.1990-present.nc 上,我在使用时看到了以下结果你的基准命令:



1)Unchunked:

  ## test复制过去相对user.self sys.self user.child sys.child 
## 2 spacechunk 1000 0.841 1.000 0.812 0.029 0 0
## 1个时间序列1000 1.325 1.576 0.944 0.381 0 0

2 )Naively Chunked:

  ##测试复制过去相对user.self sys.self user.child sys.child 
## 2 spacechunk 1000 0.788 1.000 0.788 0.000 0 0
## 1 timeseries 1000 0.814 1.033 0.814 0.001 0 0

天真的分块只是在黑暗中拍摄的;我因此使用 nccopy 实用程序:


$ nccopy -clat / 100 ,lon / 100,时间/ 100,nbnds /sst.wkmean.1990-present.nc chunked.nc


Unidata文档对于 nccopy 实用程序,可以找到这里



我希望我能推荐一种特定的策略来分块你的数据集,但它高度依赖于数据。希望上面链接的文章能够让您深入了解如何填充数据以实现您正在寻找的结果!



更新



Marcos Hermida的以下博客文章展示了在阅读特定netCDF文件的时间序列时,不同的分块策略如何影响速度。这应该只用作跳跃点。





关于重新通过 nccopy 显然悬挂;该问题似乎与4MB的默认块高速缓存大小有关。通过将其增加到4GB(或更多),您可以将大型文件的复制时间从24小时缩短到11分钟以下!





有一点我不确定;在第一个链接中,讨论是关于块缓存,但是参数传递给nccopy, -m ,指定复制缓冲区中的字节数。 nccopy的 -m 参数控制块缓存的大小。


I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution.

There are 360 latitude points, 720 longitude points, and 1420 time points per year. I have both yearly files (12 GB ea) and one file with 110 years of data (1.3 TB) stored as netCDF-4 (here is an example of the 1901 data, 1901.nc, its use policy, and the original, public files that I started with).

From what I understood, it should be faster to read from one netCDF file rather than looping over the set of files separated by year and variable originally provided.

I want to extract a time series for each grid point, e.g. 10 or 30 years from a specific latitude and longitude. However, I am finding this to be very slow. As an example, it takes me 0.01 seconds to read in 10 values over time from a point location although I can read in a global slice of 10000 values from a single time point in 0.002 second (the order of the dimensions is lat, lon, time):

## a time series of 10 points from one location:
library(ncdf4)
met.nc <- nc_open("1901.nc")
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), 
                                             count = c(1,1,10)))
   user  system elapsed 
  0.001   0.000   0.090 

## close down session

## a global slice of 10k points from one time
library(ncdf4)
system.time(met.nc <- nc_open("1901.nc"))
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), 
                                             count = c(100,100,1)))
   user  system elapsed 
  0.002   0.000   0.002 

I suspect that these files have been written to optimize reading of spatial layers because a) the order of variables is lat, lon, time, b) that would be the logical order for the climate models that generated these files and c) because global extents are the most common visualization.

I have attempted to reorder variables so that time comes first:

ncpdq -a time,lon,lat 1901.nc 1901_time.nc

(ncpdq is from the NCO (netCDF operators) software)

> library(ncdf4)

## first with the original data set:
> system.time(met.nc <- nc_open("test/1901.nc"))
   user  system elapsed 
  0.024   0.045  22.334 
> system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000))
+ )
   user  system elapsed 
  0.005   0.027  14.958 

## now with the rearranged dimensions:
> system.time(met_time.nc <- nc_open("test/1901_time.nc"))
   user  system elapsed 
  0.025   0.041  16.704 
> system.time(a <- ncvar_get(met_time.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000)))
   user  system elapsed 
  0.001   0.019   9.660 

How can I optimize reading time series at a point rather than layers of large areas at one time point? For example, would it be faster if the files were written differently, such as time, lat, lon? Is there an "easy" way to transform the order of dimensions in a netCDF-4 file?

Update

(benchmarks requested by @mdsumner)

library(rbenchmark)
library(ncdf4)
nc <- nc_open("1901.nc")
benchmark(timeseries = ncvar_get(nc, "lwdown", 
                                 start = c(1, 1, 50), 
                                 count = c(10, 10, 100)), 
          spacechunk = ncvar_get(nc, "lwdown", 
                                  start = c(1, 1, 50), 
                                  count = c(100, 100, 1)),           
          replications = 1000)

        test replications elapsed relative user.self sys.self user.child
2 spacechunk         1000   0.909    1.000     0.843    0.066          0
1 timeseries         1000   2.211    2.432     1.103    1.105          0
  sys.child
2         0
1         0

Update 2:

I have started developing a solution here. The bits and pieces are in a set of scripts in github.com/ebimodeling/model-drivers/tree/master/met/cruncep

The scripts still need some work and organization - not all of the scripts are useful. But the reads are lightning quick. Not exactly comparable to the above results, but at the end of the day, I can read a 100 year, six-hourly time series from a 1.3TB file (0.5 degree resolution in 2.5s) instantly:

system.time(ts <- ncvar_get(met.nc, "lwdown", start = c(50, 1, 1), count = c(160000, 1, 1)))
   user  system elapsed 
  0.004   0.000   0.004 

(note: The order of dimensions have changed, as described here: How can I specify dimension order when using ncdf4::ncvar_get?)

解决方案

I think the answer to this problem won't be so much re-ordering the data as it will be chunking the data. For a full discussion on the implications of chunking netCDF files, see the following blog posts from Russ Rew, lead netCDF developer at Unidata:

The upshot is that while employing different chunking strategies can achieve large increases in access speed, choosing the right strategy is non-trivial.

On the smaller sample dataset, sst.wkmean.1990-present.nc, I saw the following results when using your benchmark command:

1) Unchunked:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.841    1.000     0.812    0.029          0         0
## 1 timeseries         1000   1.325    1.576     0.944    0.381          0         0

2) Naively Chunked:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.788    1.000     0.788    0.000          0         0
## 1 timeseries         1000   0.814    1.033     0.814    0.001          0         0

The naive chunking was simply a shot in the dark; I used the nccopy utility thusly:

$ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc

The Unidata documentation for the nccopy utility can be found here.

I wish I could recommend a particular strategy for chunking your data set, but it is highly dependent on the data. Hopefully the articles linked above will give you some insight into how you might chunk your data to achieve the results you're looking for!

Update

The following blog post by Marcos Hermida shows how different chunking strategies influenced the speed when reading a time series for a particular netCDF file. This should only be used as perhaps a jumping off point.

In regards to rechunking via nccopy apparently hanging; the issue appears to be related to the default chunk cache size of 4MB. By increasing that to 4GB (or more), you can reduce the copy time from over 24 hours for a large file to under 11 minutes!

One point I'm not sure about; in the first link, the discussion is in regards to the chunk cache, but the argument passed to nccopy, -m, specifies the number of bytes in the copy buffer. The -m argument to nccopy controls the size of the chunk cache.

这篇关于从netCDF更快地阅读时间序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆