有没有一种方法可以使用POSIX_FADV_DONTNEED标志打开hdf5文件? [英] Is there a way to open hdf5 files with the POSIX_FADV_DONTNEED flag?

查看:87
本文介绍了有没有一种方法可以使用POSIX_FADV_DONTNEED标志打开hdf5文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在针对机器学习应用程序使用python中带有h5py的大型(1.2TB)未压缩,未分块的hdf5文件,该文件需要我们反复遍历整个数据集,并以随机顺序分别加载〜15MB的切片.我们正在使用具有192 GB RAM的Linux(Ubuntu 18.04)计算机.我们注意到该程序正在缓慢填充高速缓存.当高速缓存的总大小达到可与机器的全部RAM相媲美的大小时(可用内存几乎为0,但有足够的可用"内存)交换将使所有其他应用程序变慢.为了查明问题的根源,我们写了一个单独的最小示例来隔离我们的数据加载过程-但发现问题与方法的每个部分无关.

We are working with large (1.2TB) uncompressed, unchunked hdf5 files with h5py in python for a machine learning application, which requires us to work through the full dataset repeatedly, loading slices of ~15MB individually in a randomized order. We are working on a linux (Ubuntu 18.04) machine with 192 GB RAM. We noticed that the program is slowly filling the cache. When total size of cache reaches size comparable with full machine RAM (free memory in top almost 0 but plenty ‘available’ memory) swapping occurs slowing down all other applications. In order to pinpoint the source of the problem, we wrote a separate minimal example to isolate our dataloading procedures - but found that the problem was independent of each part of our method.

我们尝试过:构建numpy memmap并访问请求的切片:

We tried: Building numpy memmap and accessing requested slice:

#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = np.memmap(tv_path, mode="r", shape=hdf5_event_data.shape,                                           
                            offset=hdf5_event_data.id.get_offset(),dtype=hdf5_event_data.dtype)
self.e = np.ones((512,40,40,19))

#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e

在每次调用getitem时重新打开memmap:

Reopening the memmap on each call to getitem:

#on __getitem__:
self.event_data = np.memmap(self.path, mode="r", shape=self.shape,
                                            offset=self.offset, dtype=self.dtype)
self.e = self.event_data[index,:,:,:19]
return self.e

直接寻址h5文件并转换为numpy数组:

Addressing the h5 file directly and converting to a numpy array:

#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = hdf5_event_data
self.e = np.ones((512,40,40,19))

#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e

我们还在pytorch Dataset/Dataloader框架中尝试了上述方法-但这没什么区别.

We also tried the above approaches within pytorch Dataset/Dataloader framework - but it made no difference.

我们观察到/proc/buddyinfo证明了高内存碎片.通过同步删除缓存;echo 3>/proc/sys/vm/drop_caches在运行应用程序时无济于事.在应用程序启动之前清理缓存将消除交换行为,直到缓存再次吞噬内存-并再次开始交换.

We observe high memory fragmentation as evidenced by /proc/buddyinfo. Dropping the cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn’t help while application is running. Cleaning cache before application starts removes swapping behaviour until cache eats up the memory again - and swapping starts again.

我们的工作假设是系统正在尝试保留高速缓存的文件数据,这会导致内存碎片.最终,即使大多数内存仍然可用",在请求新内存时也会执行交换操作.

Our working hypothesis is that the system is trying to hold on to cached file data which leads to memory fragmentation. Eventually when new memory is requested swapping is performed even though most memory is still ‘available’.

这样,我们转向了围绕文件缓存更改Linux环境行为的方法,并找到了这篇文章.在python中打开h5文件或通过numpy memmap访问的一部分h5文件时,是否有一种方法可以调用POSIX_FADV_DONTNEED标志,这样就不会发生这种高速缓存的积累?在我们的用例中,我们不会长时间重新访问该特定文件的位置(直到我们访问文件的所有其他片")

As such, we turned to ways to change the Linux environment’s behaviour around file caching and found this post . Is there a way to call the POSIX_FADV_DONTNEED flag when opening an h5 file in python or a portion of that we accessed via numpy memmap, so that this accumulation of cache does not occur? In our use case we will not be re-visiting that particular file location for a long time (till we access all other remaining ‘slices’ of the file)

推荐答案

您可以使用

You can use os.posix_fadvise to tell the OS how regions you plan to load will be used. This naturally requires a bit of low-level tweaking to determine your file descriptor, and get an idea of the regions you plan on reading.

获取文件描述符的最简单方法是自己提供:

The easiest way to get the file descriptor is to supply it yourself:

pf = open(tv_path, 'rb')
f = h5py.File(pf, 'r')

您现在可以设置建议.对于整个文件:

You can now set the advice. For the entire file:

os.posix_fadvise(os.fileno(pf), 0, f.id.get_filesize(), os.POSIX_FADV_DONTNEED)

或对于特定数据集:

os.posix_fadvise(os.fileno(pf), hdf5_event_data.id.get_offset(),
                 hdf5_event_data.id.get_storage_size(), os.POSIX_FADV_DONTNEED)

其他要查看的东西

H5py自己进行块缓存.您可能要尝试关闭此功能:

H5py does its own chunk caching. You may want to try turning this off:

f = h5py.File(..., rdcc_nbytes=0)

作为替代方案,您可能希望尝试使用其他

As an alternative, you may want to try using one of the other drivers provided in h5py, like 'sec2':

f = h5py.File(..., driver='sec2')

这篇关于有没有一种方法可以使用POSIX_FADV_DONTNEED标志打开hdf5文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆