如何利用Java或Python中的文件系统缓存? [英] How to make use of the filesystem cache in Java or Python?

查看:125
本文介绍了如何利用Java或Python中的文件系统缓存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Elasticsearch 网站上的最新博文正在谈论他们新的1.4 beta版本的功能。

A recent blog post on Elasticsearch website is talking about the features of their new 1.4 beta release.

我非常好奇他们如何使用文件系统缓存:

I am very curious about how they make use of the filesystem cache:


最近的版本增加了对doc值的支持。本质上,doc值提供与内存中字段数据相同的功能,但它们在索引时刻写入磁盘。他们提供的好处是它们消耗很少的堆空间。从磁盘读取文档值,而不是从内存中读取。磁盘访问缓慢时,文档值受益于内核的文件系统缓存。与JVM堆不同,文件系统缓存不受32GB限制。通过将fielddata从堆转换到文件系统缓存,您可以使用较小的堆,这意味着更快的垃圾收集,从而更稳定的节点。

Recent releases have added support for doc values. Essentially, doc values provide the same function as in-memory fielddata, but they are written to disk at index time. The benefit that they provide is that they consume very little heap space. Doc values are read from disk, instead of from memory. While disk access is slow, doc values benefit from the kernel’s filesystem cache. The filesystem cache, unlike the JVM heap, is not constrained by the 32GB limit. By shifting fielddata from the heap to the filesystem cache, you can use smaller heaps which means faster garbage collections and thus more stable nodes.

在此版本之前,doc值显着比内存中的fielddata慢。 此版本中的更改显着提高了性能,使其几乎与内存中现场数据一样快。

Before this release, doc values were significantly slower than in-memory fielddata. The changes in this release have improved the performance significantly, making them almost as fast as in-memory fielddata.



<这是否意味着我们可以操纵文件系统缓存的行为,而不是被动地等待操作系统的影响?如果是这样,在正常应用程序开发中如何使用文件系统缓存?说,如果我正在编写一个Python或Java程序,我该怎么做?

Does this mean that we can manipulate the behavior of filesystem cache instead of waiting for the effect from the OS passively? If it is the case, how can we make use of the filesystem cache in normal application developement? Say, if I'm writing a Python or Java program, how can I do this?

推荐答案

文件系统缓存是与最终用户透明的OS内部运行相关的实现细节。这不是需要调整或改变的东西。 Lucene在管理索引片段时已经使用了文件系统缓存。每当一些东西被索引到Lucene(通过Elasticsearch),这些文档被写入段,首先被写入文件系统缓存,然后在一段时间之后(当translog - 跟踪被索引的文档的方式是 - 完整的例子)缓存的内容被写入一个实际的文件。但是,要索引的文档在文件系统缓存中,它们仍然可以被访问。

File-system cache is an implementation detail related to OS inner workings that is transparent to the end user. It isn't something that needs adjustments or changes. Lucene already makes use of the file-system cache when it manages the index segments. Every time something is indexed into Lucene (via Elasticsearch) those documents are written to segments, which are first written to the file-system cache and then, after some time (when the translog - a way of keeping track of documents being indexed - is full for example) the content of the cache is written to an actual file. But, while the documents to be indexed are in file-system cache, they can still be accessed.

这个doc值实现的改进是指这个功能,现在使用文件系统缓存,因为它们是从磁盘读取的,放入缓存并从那里访问,而不是占用堆空间。

This improvement in doc values implementation refers to this feature as being able to use the file-system cache now, as they are read from disk, put in cache and accessed from there, instead of taking up Heap space.

这个文件系统正在访问的缓存在这个优秀的博客文章

How this file-system cache is being accessed is described in this excellent blog post:


在我们以前的方法中,我们依靠使用系统调用来在文件系统缓存和本地Java之间复制数据堆。如何直接访问文件系统缓存?这是mmap的作用!

In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does!

基本上,mmap执行的操作与将Lucene索引作为交换文件一样。 mmap()syscall告诉O / S内核虚拟地将我们的整个索引文件映射到先前描述的虚拟地址空间,并使它们看起来像我们的Lucene进程可用的RAM。然后,我们可以访问我们的磁盘上的索引文件,就像它将是一个大字节[]数组(在Java中由ByteBuffer接口封装,以使其可以安全地使用Java代码)。如果我们从Lucene代码访问这个虚拟地址空间,我们不需要执行任何系统调用,处理器的MMU和TLB就可以处理我们所有的映射。如果数据仅在磁盘上,则MMU将导致中断,O / S内核会将数据加载到文件系统缓存中。如果已经在缓存中,MMU / TLB将其直接映射到文件系统缓存中的物理内存。

Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache.

与实际手段相关在Java程序中使用mmap,我认为这是这样做的类和方法

Related to the actual means of using mmap in a Java program, I think this is the class and method to do so.

这篇关于如何利用Java或Python中的文件系统缓存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆