Java性能:自定义对象的多数组与单数组 [英] Java Performance: multiple array vs single array of custom object

查看:125
本文介绍了Java性能:自定义对象的多数组与单数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这两种不同的代码解决方案的最佳性能是什么?

  class Meme {private int [] a = new int [100];private int [] b =新的int [100];private int [] c = new int [100];} 

vs

  class Meme {私有MemeProp [] prop =新的MemeProp [100];MemeProp类{诠释int b;int c;}} 

考虑连续访问读写属性 a b c

我需要编写代码以便快速执行,而不是为了进行内存优化.因此,我的性能基准是执行时间

解决方案

这在很大程度上取决于您的内存访问模式.

第一个肯定更紧凑.

Java中用户定义的类型会带来一些开销,例如每个对象的指针开销(64位为8字节).一个 Integer 可以占用16个字节(对于 object ,则为8个字节;对于 int ,则为4个字节;对于对齐,则为4个字节),例如,当int 仅占用4个字节.它类似于 class ,其中的虚函数在C ++中存储了 vptr .

鉴于此,如果我们查看 MemeProp 的内存使用情况,我们将得到以下内容:

 类MemeProp {//具有8字节对齐要求的不可见8字节指针诠释//4字节int b;//4字节int c;//4字节//4个字节的填充,用于对齐不可见的字段} 

每个 MemeProp 实例产生的内存大小为24个字节.当我们使用其中的一百个时,最终会占用2400字节的内存.

同时,您的3个数组每个都包含一百个 ints ,它们只需要略多于1200个字节(存储长度和指针的数组开销要少一些).这几乎是第二个版本的一半.

顺序访问

按顺序处理数据时,速度和大小往往趋于齐头并进.如果更多的数据可以容纳到页面和缓存行中,那么在较大的表示与较紧的表示之间的机器指令差异不大的情况下,您的代码通常会更快地消耗它们.

因此,从顺序访问的角度来看,您的第一个版本需要一半的内存,运行速度可能会快很多(在某些情况下可能快两倍).

随机访问

但是随机访问是另一种情况.假设 a b c 是同样热门的字段,它们总是在紧密的循环中一起访问,这些紧密循环中的结构随机访问

在这种情况下,您的第二个版本实际上可能会更好.这是因为它为 MemeProp 对象提供了连续的布局,其中 a b c 最终将成为总是在内存中彼此相邻(无论垃圾回收器最终如何重新排列 MemeProp 实例的内存布局).

在您的第一个版本中,您的 a b c 数组分布在内存中.它们之间的跨度不能小于400字节.如果您最终访问某个随机元素(例如第66个元素),而当我们访问 a [65] b [65] 时,这等于可能会导致更多的缓存丢失.和 c [65] .如果这是我们第一次访问这些字段,那么最终将导致3次缓存未命中.然后,我们可以访问 a [7] b [7] c [7] ,它们之间的间隔均为228个字节来自 a [65] b [65] c [65] ,最终可能会导致3个以上的缓存未命中./p>

可能比两者都好

比方说,您需要随机的AoS样式访问,并且所有字段始终一起访问.在这种情况下,最佳的表示形式可能是:

  class Meme {private int [] abc = new int [100 * 3];} 

这最终会占用所有三种解决方案的最少内存,并确保单个 MemeProp abc 字段彼此相邻.

当然,在某些情况下,您的里程可能会有所不同,但是如果您同时需要随机访问和顺序访问,则这可能是这三种方法中最强的候选人.

热/冷字段拆分

为完整起见,让我们考虑一种情况,在这种情况下,您的内存访问模式是顺序的,但并非所有字段( a/b/c )都是一起访问的.取而代之的是,您有一个对性能至关重要的循环,可以同时访问 a b ,还有一些对性能不重要的循环,仅访问 c .在这种情况下,您可能会从这样的表示中获得最好的结果:

  class Meme {private int [] ab = new int [100 * 2];private int [] c = new int [100];} 

因此,我们的数据如下所示:

  abababab ...抄送... 

...与此相反:

  abcabcabcabc ... 

在这种情况下,通过吊起 c 并将其放入一个单独的数组中,它不再与 a b 字段交织,允许计算机以更快的速度消耗"这些相关数据(在那些对性能至关重要的循环中的 a b ),因为它将内存的连续块移动到了更快的位置但内存形式较小(物理映射页,CPU缓存行).

SoA访问模式

最后,假设您要分别访问所有字段.每个关键循环仅访问 a ,仅访问 b 或仅访问 c .在这种情况下,您的第一个表示可能是最快的,特别是如果您的编译器可以发出有效的SIMD指令,该指令可以向量化并行处理多个此类字段.

缓存行中的相关数据

如果您发现所有这些令人困惑的地方,我不会怪您,但是计算机架构向导 harold 曾在此站点上告诉过我一次.他以一种最优雅的方式总结了所有这些,目的是避免将不相关数据加载到缓存行中,该缓存行将仅在不使用的情况下被加载和逐出.在整个性能分析过程中,尽管我已经对此有所了解,但我从来没有找到一种简洁明了的方式来使所有这些高速缓存未命中有意义.

我们的硬件和操作系统希望将内存从较大但较慢的内存形式转移到较小但较快的内存形式.当这样做时,它趋向于抓住少数人的记忆".好吧,如果您想从碗里抓起少数几个M& M,但是您只对吃绿色的M& M感兴趣,那么抓取几个混合的M& M并只挑选绿色的M& M变得非常低效.然后将其他所有碗都放回碗中.在这种情况下,如果您的碗只装满绿色的M& M,效率会大大提高,如果我使用一个非常粗糙但希望有帮助的类比,那么当您尝试确定有效的内存布局时,这就是目标.如果您要在关键循环中访问的只是类比的绿色M& M,请不要将它们与红色,蓝色,黄色等混合(插入数据),而是将所有绿色都放在旁边.彼此在内存中,因此当您抓住少数东西时,您只会得到想要的东西.

面向数据的设计

如果您期望这些 MemeProps 有大量输入,循环的场景,那么您做的正确的事情之一就是在 Meme 上在集合级别设计外部公共接口.code>级别,并将 MemeProp 字段转换为私人详细信息.

这可能是您进行度量之前最有效的策略,即确定批量处理数据的位置(尽管 100 并非完全批量,但我希望您的实际情况要大得多),并相应地设计您的公共接口.

例如,如果您正在设计 Image 类并且性能是关键目标,那么您要避免暴露提供对像素进行操作的 Pixel 对象-以像素为单位.最好在批量 Image Scanline 级别上设计此接口,以允许批量处理一堆像素.

与具有一万个客户端依赖项(代表单个像素,例如

)的某些颗粒状 Pixel 对象界面的设计相比,这给您留出了更多的摆动空间来测量和调整数据表示形式.

因此,无论如何,最安全的选择是进行度量,但最好是在适当粗略的水平上进行接口设计的设计.

What's the best performance of these two different code solution:

class Meme {
  private int[] a = new int[100];
  private int[] b = new int[100];
  private int[] c = new int[100];
}

vs

class Meme {
   private MemeProp[] prop = new MemeProp[100];
   class MemeProp {
     int a;
     int b;
     int c;
   }
}

consider in continuous access for read and write properties a, b, c

I need to write code for fast execution not for memory optimization. Therefore, my performance benchmark is execution time

解决方案

It depends a lot on your memory access patterns.

The first one is definitely more compact.

User-defined types in Java carry some overhead, something like a pointer overhead per object (8 bytes on 64-bit). An Integer can take 16 bytes (8 bytes for object + 4 bytes for int + 4 for alignment), e.g., while an int takes a mere 4 bytes. It's analogous to a class with virtual functions in C++ storing a vptr.

Given this, if we look at the memory usage of MemeProp, we have something like this:

  class MemeProp {
     // invisible 8 byte pointer with 8-byte alignment requirements
     int a; // 4 bytes
     int b; // 4 bytes
     int c; // 4 bytes
     // 4 bytes of padding for alignment of invisible field
   }

The resulting memory size is 24 bytes per MemeProp instance. When we take a hundred of those, we end up with a total memory usage of 2400 bytes.

Meanwhile, your 3 arrays each containing a hundred ints will only require slightly over 1200 bytes (a little teeny bit extra for array overhead storing a length and pointer). That's very close to half the size of your second version.

Sequential Access

When you process data sequentially, speed and size often tend to go hand-in-hand. If more data can fit into a page and cache line, your code will typically consume it much faster in cases where the machine instructions don't vary much between the larger vs. tighter representation.

So from a sequential access standpoint, your first version which requires half the memory is likely to go quite a bit faster (maybe almost twice as fast in some cases).

Random Access

Yet random access is a different case. Let's say a, b, and c are equally hot fields, always accessed together in your tight loops which have a random access pattern into this structure.

In that case, your second version may actually fare better. It's because it offers a contiguous layout for a MemeProp object where a, b, and c will end up being right next to each other in memory, always (no matter how the garbage collector ends up rearranging the memory layout for a MemeProp instance).

With your first version, your a, b, and c arrays are spread out in memory. The stride between them can never be smaller than 400 bytes. This equates to potentially a lot more cache misses if you end up accessing some random element, say the 66th element, when we access a[65], b[65], and c[65]. If that's the first time we're accessing those fields, we'll end up with 3 cache misses. Then we might access a[7], b[7], and c[7], and those would all relatively be 228 bytes apart from a[65], b[65], and c[65], and we might end up with 3 more cache misses.

Possibly Better than Both

Let's say you need random AoS-style access and all fields are always accessed together. In that case, the most optimal representation will likely be this:

class Meme {
    private int[] abc = new int[100 * 3];
}

This ends up taking the minimum amount of memory of all three solutions and guarantees that abc fields for a single MemeProp are right next to each other.

Of course, there might be some cases where your mileage may vary, but this might be the strongest candidate among these three if you need both random and sequential access.

Hot/Cold Field Splitting

For completeness, let's consider a case where your memory access patterns are sequential but not all fields (a/b/c) are accessed together. Instead, you have one performance-critical loop which accesses a and b together, and some non-performance-critical code which accesses only c. In that case, you might get the best results from a representation like this:

class Meme {
    private int[] ab = new int[100 * 2];
    private int[] c = new int[100];
}

This makes it so our data looks like this:

abababab...
ccccc...

... as opposed to this:

abcabcabcabc...

In this case, by hoisting out c and putting it into a separate array, it is no longer interleaved with a and b fields, allowing the computer to "consume" this relevant data (a and b in those performance-critical loops) at a faster rate as it's moving contiguous chunks of this memory into faster but smaller forms of memory (physically-mapped page, CPU cache line).

SoA Access Pattern

Finally, let's say you are accessing all fields separately. Each critical loop only accesses a, only accesses b, or only accesses c. In that case, your first representation is likely to be the fastest, especially if your compiler can emit efficient SIMD instructions which can vectorize processing of multiple such fields in parallel.

Relevant Data in Cache Lines

If you find all of this confusing, I don't blame you, but there was something harold, a computer architecture wizard, told me once on this site. He summed it all up in the most elegant way that the goal should be to avoid loading irrelevant data in a cache line which is only going to be loaded and evicted without being used. As much as I had developed some intuition of that over all my profiling sessions, I never found such a concise and elegant way to put it that made sense of all those cache misses.

Our hardware and operating systems want to move memory from bigger but slower forms of memory to smaller but faster forms of memory. When it does that, it tends to "grab memory by the handful". Well, if you are trying to grab M&Ms by the handful from a bowl but you're only interested in eating green M&Ms, it becomes very inefficient to grab a handful of mixed M&Ms only to pick out the green ones and then return all the other ones to the bowl. It becomes so much more efficient in that case if you had a bowl filled with only green M&Ms, and that's kind of the goal when you are trying to settle on an efficient memory layout if I use a very crude but hopefully helpful analogy. If all you want to access in a critical loop are the analogical green M&Ms, don't mix them up (interleave that data) with red ones, blue ones, yellow ones, etc. Instead keep all those green ones right next to each other in memory so that when you grab things by the handful, you're only getting what you want.

Data-Oriented Design

One of the things you are doing correctly if you anticipate a large input, loopy scenario for these MemeProps is designing your outer, public interface at the collection level, at the Meme level and turning MemeProp fields into a private detail.

That's probably the most effective strategy before you measure is identifying the places where you process data in bulk (though 100 isn't exactly bulk, I'm hoping your actual scenario is much bigger), and designing your public interfaces accordingly.

For example, if you are designing an Image class and performance is a key goal, then you want to avoid exposing a Pixel object which provides operations on a pixel-by-pixel basis. Far better is designing this interface at the bulk Image or Scanline level, allowing a bunch of pixels to be processed in bulk.

That leaves you a lot more wiggle room to measure and tune data representations than a design which has ten thousand client dependencies to some granular Pixel object interface which represents a single pixel, e.g.

So anyway, safest bet is to measure, but it's good that you are designing at an appropriately-coarse level for your interface designs.

这篇关于Java性能:自定义对象的多数组与单数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆