从 netCDF 更快地读取时间序列? [英] Faster reading of time series from netCDF?

查看:50
本文介绍了从 netCDF 更快地读取时间序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些大型 netCDF 文件,其中包含 0.5 度分辨率的地球 6 小时数据.

I have some large netCDF files that contain 6 hourly data for the earth at 0.5 degree resolution.

每年有 360 个纬度点、720 个经度点和 1420 个时间点.我有两个年度文件 (12 GB ea) 和一个包含 110 年数据 (1.3 TB) 的文件存储为 netCDF-4(这是 1901 年数据的示例,1901.nc,它的 使用政策,以及原件,我开始使用的公共文件).

There are 360 latitude points, 720 longitude points, and 1420 time points per year. I have both yearly files (12 GB ea) and one file with 110 years of data (1.3 TB) stored as netCDF-4 (here is an example of the 1901 data, 1901.nc, its use policy, and the original, public files that I started with).

据我所知,从一个 netCDF 文件中读取应该更快,而不是循环遍历一组文件 按年份和最初提供的变量分隔.

From what I understood, it should be faster to read from one netCDF file rather than looping over the set of files separated by year and variable originally provided.

我想为每个网格点提取一个时间序列,例如距特定纬度和经度的 10 或 30 年.但是,我发现这非常慢.例如,虽然我可以在 0.002 秒内从单个时间点读取 10000 个值的全局切片(维度的顺序是 lat、lon、时间):

I want to extract a time series for each grid point, e.g. 10 or 30 years from a specific latitude and longitude. However, I am finding this to be very slow. As an example, it takes me 0.01 seconds to read in 10 values over time from a point location although I can read in a global slice of 10000 values from a single time point in 0.002 second (the order of the dimensions is lat, lon, time):

## a time series of 10 points from one location:
library(ncdf4)
met.nc <- nc_open("1901.nc")
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), 
                                             count = c(1,1,10)))
   user  system elapsed 
  0.001   0.000   0.090 

## close down session

## a global slice of 10k points from one time
library(ncdf4)
system.time(met.nc <- nc_open("1901.nc"))
system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), 
                                             count = c(100,100,1)))
   user  system elapsed 
  0.002   0.000   0.002 

我怀疑编写这些文件是为了优化空间层的读取,因为 a) 变量的顺序是纬度、经度、时间,b) 这将是生成这些文件的气候模型的逻辑顺序和 c)因为全局范围是最常见的可视化.

I suspect that these files have been written to optimize reading of spatial layers because a) the order of variables is lat, lon, time, b) that would be the logical order for the climate models that generated these files and c) because global extents are the most common visualization.

我试图对变量重新排序,以便时间优先:

I have attempted to reorder variables so that time comes first:

ncpdq -a time,lon,lat 1901.nc 1901_time.nc

(ncpdq 来自 NCO(netCDF 运营商)软件)

> library(ncdf4)

## first with the original data set:
> system.time(met.nc <- nc_open("test/1901.nc"))
   user  system elapsed 
  0.024   0.045  22.334 
> system.time(a <- ncvar_get(met.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000))
+ )
   user  system elapsed 
  0.005   0.027  14.958 

## now with the rearranged dimensions:
> system.time(met_time.nc <- nc_open("test/1901_time.nc"))
   user  system elapsed 
  0.025   0.041  16.704 
> system.time(a <- ncvar_get(met_time.nc, "lwdown", start = c(100,100,1), count = c(1, 1, 1000)))
   user  system elapsed 
  0.001   0.019   9.660 

如何优化一个点的阅读时间序列,而不是一个时间点的大区域层?例如,如果文件的写入方式不同,例如时间、纬度、经度,是否会更快?有没有一种简单"的方法来转换 netCDF-4 文件中的维度顺序?

How can I optimize reading time series at a point rather than layers of large areas at one time point? For example, would it be faster if the files were written differently, such as time, lat, lon? Is there an "easy" way to transform the order of dimensions in a netCDF-4 file?

(@mdsumner 要求的基准测试)

(benchmarks requested by @mdsumner)

library(rbenchmark)
library(ncdf4)
nc <- nc_open("1901.nc")
benchmark(timeseries = ncvar_get(nc, "lwdown", 
                                 start = c(1, 1, 50), 
                                 count = c(10, 10, 100)), 
          spacechunk = ncvar_get(nc, "lwdown", 
                                  start = c(1, 1, 50), 
                                  count = c(100, 100, 1)),           
          replications = 1000)

        test replications elapsed relative user.self sys.self user.child
2 spacechunk         1000   0.909    1.000     0.843    0.066          0
1 timeseries         1000   2.211    2.432     1.103    1.105          0
  sys.child
2         0
1         0

更新 2:

我已经开始在这里开发解决方案.零碎部分位于 github 中的一组脚本中.com/ebimodeling/model-drivers/tree/master/met/cruncep

脚本仍然需要一些工作和组织 - 并非所有脚本都有用.但读取速度快如闪电.与上述结果并不完全可比,但在一天结束时,我可以立即从 1.3TB 文件(0.5 度分辨率,2.5 秒)中读取 100 年、每 6 小时的时间序列:

The scripts still need some work and organization - not all of the scripts are useful. But the reads are lightning quick. Not exactly comparable to the above results, but at the end of the day, I can read a 100 year, six-hourly time series from a 1.3TB file (0.5 degree resolution in 2.5s) instantly:

system.time(ts <- ncvar_get(met.nc, "lwdown", start = c(50, 1, 1), count = c(160000, 1, 1)))
   user  system elapsed 
  0.004   0.000   0.004 

(注意:维度顺序已更改,如下所述:如何指定维度顺序使用 ncdf4::ncvar_get 时?)

(note: The order of dimensions have changed, as described here: How can I specify dimension order when using ncdf4::ncvar_get?)

推荐答案

我认为这个问题的答案与其说是重新排序数据,不如说是对数据进行分块.有关分块 netCDF 文件的影响的完整讨论,请参阅 Unidata 的 NetCDF 首席开发人员 Russ Rew 的以下博文:

I think the answer to this problem won't be so much re-ordering the data as it will be chunking the data. For a full discussion on the implications of chunking netCDF files, see the following blog posts from Russ Rew, lead netCDF developer at Unidata:

结果是虽然采用不同的分块策略可以大大提高访问速度,但选择正确的策略并非易事.

The upshot is that while employing different chunking strategies can achieve large increases in access speed, choosing the right strategy is non-trivial.

在较小的样本数据集 sst.wkmean.1990-present.nc 上,我在使用您的基准测试命令时看到了以下结果:

On the smaller sample dataset, sst.wkmean.1990-present.nc, I saw the following results when using your benchmark command:

1) 未分块:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.841    1.000     0.812    0.029          0         0
## 1 timeseries         1000   1.325    1.576     0.944    0.381          0         0

2) 天真分块:

## test replications elapsed relative user.self sys.self user.child sys.child
## 2 spacechunk         1000   0.788    1.000     0.788    0.000          0         0
## 1 timeseries         1000   0.814    1.033     0.814    0.001          0         0

天真的分块只是在黑暗中拍摄;我这样使用了 nccopy 实用程序:

The naive chunking was simply a shot in the dark; I used the nccopy utility thusly:

$ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc

$ nccopy -c"lat/100,lon/100,time/100,nbnds/" sst.wkmean.1990-present.nc chunked.nc

nccopy 实用程序的 Unidata 文档可以在 此处.

The Unidata documentation for the nccopy utility can be found here.

我希望我可以推荐一种用于分块数据集的特定策略,但它高度依赖于数据.希望上面链接的文章能让您深入了解如何对数据进行分块以获得所需的结果!

I wish I could recommend a particular strategy for chunking your data set, but it is highly dependent on the data. Hopefully the articles linked above will give you some insight into how you might chunk your data to achieve the results you're looking for!

Marcos Hermida 的以下博客文章展示了不同的分块策略如何影响读取特定 netCDF 文件的时间序列时的速度.这应该只是作为一个起点.

The following blog post by Marcos Hermida shows how different chunking strategies influenced the speed when reading a time series for a particular netCDF file. This should only be used as perhaps a jumping off point.

关于通过 nccopy 重新分块,显然挂起;该问题似乎与 4MB 的默认块缓存大小有关.通过将其增加到 4GB(或更多),您可以将大型文件的复制时间从 24 小时以上减少到 11 分钟以下!

In regards to rechunking via nccopy apparently hanging; the issue appears to be related to the default chunk cache size of 4MB. By increasing that to 4GB (or more), you can reduce the copy time from over 24 hours for a large file to under 11 minutes!

有一点我不确定;在第一个链接中,讨论的是关于 chunk cache,但传递给 nccopy 的参数 -m 指定了复制缓冲区中的字节数.nccopy 的 -m 参数控制块缓存的大小.

One point I'm not sure about; in the first link, the discussion is in regards to the chunk cache, but the argument passed to nccopy, -m, specifies the number of bytes in the copy buffer. The -m argument to nccopy controls the size of the chunk cache.

这篇关于从 netCDF 更快地读取时间序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆