如何在XArray中合并具有不同尺寸大小的多个数据集(.h5文件) [英] How do I combine multiple datasets (.h5 files) with different dimensions sizes in xarray

查看:257
本文介绍了如何在XArray中合并具有不同尺寸大小的多个数据集(.h5文件)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了几种方法来制作 xarray (xr)数据集多个.h5文件.这些文件包含来自 SMAP 项目的有关土壤水分含量的数据以及其他有用变量.每个变量代表一个二维数组.每个文件中变量的数量及其标签均相等.问题是尺寸x和y的尺寸大小不相等.

I tried several methods to make a xarray (xr) dataset out of multiple .h5 files. The files contain data from SMAP project on soil moisture content along with other useful variables. Each variable represent a 2-D Array. The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.

Example dataset load via xr.open_dataset()

<xarray.Dataset>
Dimensions:                                     (x: 54, y: 129)
Coordinates:
    EASE_column_index_3km                       (x, y) float32 ...
    EASE_column_index_apm_3km                   (x, y) float32 ...
    EASE_row_index_3km                          (x, y) float32 ...
    EASE_row_index_apm_3km                      (x, y) float32 ...
    latitude_3km                                (x, y) float32 ...
    latitude_apm_3km                            (x, y) float32 ...
    longitude_3km                               (x, y) float32 ...
    longitude_apm_3km                           (x, y) float32 ...
Dimensions without coordinates: x, y
Data variables:
    SMAP_Sentinel_overpass_timediff_hr_3km      (x, y) timedelta64[ns] ...
    SMAP_Sentinel_overpass_timediff_hr_apm_3km  (x, y) timedelta64[ns] ...
    albedo_3km                                  (x, y) float32 ...
    albedo_apm_3km                              (x, y) float32 ...
    bare_soil_roughness_retrieved_3km           (x, y) float32 ...
    bare_soil_roughness_retrieved_apm_3km       (x, y) float32 ...
    beta_tbv_vv_3km                             (x, y) float32 ...
    beta_tbv_vv_apm_3km                         (x, y) float32 ...
    disagg_soil_moisture_3km                    (x, y) float32 ...
    disagg_soil_moisture_apm_3km                (x, y) float32 ...
    disaggregated_tb_v_qual_flag_3km            (x, y) float32 ...
    disaggregated_tb_v_qual_flag_apm_3km        (x, y) float32 ...
    gamma_vv_xpol_3km                           (x, y) float32 ...
    gamma_vv_xpol_apm_3km                       (x, y) float32 ...
    landcover_class_3km                         (x, y) float32 ...
    landcover_class_apm_3km                     (x, y) float32 ...
    retrieval_qual_flag_3km                     (x, y) float32 ...
    retrieval_qual_flag_apm_3km                 (x, y) float32 ...
    sigma0_incidence_angle_3km                  (x, y) float32 ...
    sigma0_incidence_angle_apm_3km              (x, y) float32 ...
    sigma0_vh_aggregated_3km                    (x, y) float32 ...
    sigma0_vh_aggregated_apm_3km                (x, y) float32 ...
    sigma0_vv_aggregated_3km                    (x, y) float32 ...
    sigma0_vv_aggregated_apm_3km                (x, y) float32 ...
    soil_moisture_3km                           (x, y) float32 ...
    soil_moisture_apm_3km                       (x, y) float32 ...
    soil_moisture_std_dev_3km                   (x, y) float32 ...
    soil_moisture_std_dev_apm_3km               (x, y) float32 ...
    spacecraft_overpass_time_seconds_3km        (x, y) timedelta64[ns] ...
    spacecraft_overpass_time_seconds_apm_3km    (x, y) timedelta64[ns] ...
    surface_flag_3km                            (x, y) float32 ...
    surface_flag_apm_3km                        (x, y) float32 ...
    surface_temperature_3km                     (x, y) float32 ...
    surface_temperature_apm_3km                 (x, y) float32 ...
    tb_v_disaggregated_3km                      (x, y) float32 ...
    tb_v_disaggregated_apm_3km                  (x, y) float32 ...
    tb_v_disaggregated_std_3km                  (x, y) float32 ...
    tb_v_disaggregated_std_apm_3km              (x, y) float32 ...
    vegetation_opacity_3km                      (x, y) float32 ...
    vegetation_opacity_apm_3km                  (x, y) float32 ...
    vegetation_water_content_3km                (x, y) float32 ...
    vegetation_water_content_apm_3km            (x, y) float32 ...
    water_body_fraction_3km                     (x, y) float32 ...
    water_body_fraction_apm_3km                 (x, y) float32 ...

示例变量数据集.soil_moisture_3km

Example variable dataset.soil_moisture_3km

<xarray.DataArray 'soil_moisture_3km' (x: 54, y: 129)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
Coordinates:
    EASE_column_index_3km      (x, y) float32 ...
    EASE_column_index_apm_3km  (x, y) float32 ...
    EASE_row_index_3km         (x, y) float32 ...
    EASE_row_index_apm_3km     (x, y) float32 ...
    latitude_3km               (x, y) float32 ...
    latitude_apm_3km           (x, y) float32 ...
    longitude_3km              (x, y) float32 ...
    longitude_apm_3km          (x, y) float32 ...
Dimensions without coordinates: x, y
Attributes:
    units:        cm**3/cm**3
    valid_min:    0.0
    long_name:    Representative soil moisture measurement for the 3 km Earth...
    coordinates:  /Soil_Moisture_Retrieval_Data_3km/latitude_3km /Soil_Moistu...
    valid_max:    0.75

首先,我尝试使用以下方式打开文件:

First i tried to open the files with:

test = xr.open_mfdataset(list_of_paths)

发生此错误:

ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {129, 132}

然后我尝试通过协调结合

Then i try combine by coords

test = xr.open_mfdataset(list_of_paths, combine='by_coords')

产生此错误:

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

尝试一下:

test = xr.open_mfdataset(list_of_paths, coords=['latitude_3km', 'longitude_3km'], combine='by_coords')

以同样的错误结束.

然后,我尝试使用xr.open_dataset()打开每个文件,并尝试可以在

Then i try to open every file with xr.open_dataset() and try every method i can find on documentation page for combining data like merge, combine, broadcast_like, align & combine... but every time end up with the same problem that the dimensions are not equal. What is the common approach to reshape, align the dimensions or whatever is possible to solve this problem ?

更新:
我找到了解决该问题的方法,但是首先我想我忘记提到我尝试在维度时间连接的不同文件具有不同的坐标和维度.我尝试从所有模型中构建的图像的重叠区域具有相同的经度和纬度值,但部分区域也没有重叠.

UPDATE :
I found a workaround for my problem, but first I think I have forgotten to mention that the different files which I try to concatenate along the dimension time have different coordinates and dimensions. The images I try to build my model from all have overlapping areas with same longitude and latitude values but also parts with no overlapping.

推荐答案

每个文件中变量的数量及其标签均相等.问题是x和y的尺寸大小不相等.

The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.

对不起,每个文件中的len(x)是否相同?和len(y)一样吗?否则,open_mfdataset无法立即处理.

Sorry, is len(x) the same in every file? And the len(y) the same? Otherwise this can't be handled immediately by open_mfdataset.

如果它们相同,则理论上您应该能够以两种不同的方式来做到这一点.

If they are the same, you should in theory be able to do this in two different ways.

那么您将遇到一个2D串联问题:您需要对数据集进行排列,以使它们沿着x和y结合在一起时,会形成一个较大的数据集,同时具有x和y维度.

Then you have a 2D concatenation problem: you need to arrange the datasets such that when joined up along x and y, they make a larger dataset which also has dimensions x and y.

1)使用combine='nested'

1) Using combine='nested'

您可以手动指定需要它们加入的顺序.xarray允许您通过将数据集作为网格(指定为嵌套列表)传递来进行此操作.在您的情况下,如果我们有4个文件(分别命名为[upper_left,upper_right,lower_left,lower_right]),则可以将它们组合为:

You can manually specify the order that you need them joined up in. xarray allows you to do this by passing the datasets as a grid, specified as a nested list. In your case, if we had 4 files (named [upper_left, upper_right, lower_left, lower_right]), we would combine them like so:

from xarray import open_mfdataset

grid = [[upper_left, upper_right], 
        [lower_left, lower_right]]

ds = open_mfdataset(grid, concat_dim=['x', 'y'], combine='nested')

我们必须告诉open_mfdataset网格的行和列对应于数据的哪些维度,因此它将知道将数据串联在一起的维度.这就是为什么我们需要通过concat_dim=['x', 'y'].

We had to tell open_mfdataset which dimensions of the data the rows and colums of the grid corresponded to, so it would know which dimensions to concatenate the data along. That's why we needed to pass concat_dim=['x', 'y'].

2)使用combine='by_coords'

2) Using combine='by_coords'

但是您的数据中已经有坐标-xarray不能仅使用它们以正确的顺序排列数据集吗?这就是combine='by_coords'选项的作用,但是不幸的是,它需要一维坐标(也称为维坐标)来排列数据.您的文件没有任何文件(这就是为什么打印输出显示Dimensions without coordinates: x, y的原因.)

But your data has coordinates in it already - can't xarray just use those to arrange the datasets in the right order? That is what the combine='by_coords' option is for, but unfortunately, it requires 1-dimensional coordinates (also known as dimensional coordinates) to arrange the data. Your files don't have any of those (that's why the printout says Dimensions without coordinates: x, y).

如果可以先向文件添加一维坐标,则可以使用combine='by_coords',然后可以按任何顺序传递所有文件的列表.但是否则,在这种情况下,您将不得不使用combine='nested'.

If you can add 1-dimensional coordinates to your files first, then you could use combine='by_coords', then you could just pass a list of all the files in any order. But otherwise you'll have to use combine='nested' in this case.

(这里不需要coords参数,这与如何连接不同的坐标有关,而不是要使用的数据集的排列方式.)

(You don't need the coords argument here, that's to do with how different coordinates are to be joined up, not the arrangement of datasets to use.)

这篇关于如何在XArray中合并具有不同尺寸大小的多个数据集(.h5文件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆