磁盘阵列/链表:性能取决于遍历*向*? [英] Array/Linked list: performance depends on the *direction* of traversal?

查看:160
本文介绍了磁盘阵列/链表:性能取决于遍历*向*?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个帖子被分成两个主要部分。第一部分介绍了原始的测试案例和放大器;结果,与我的想法了。第二部分详细介绍了修改后的测试案例,它的效果。

This post is divided to two major sections. The first section introduces the original test cases & results, and my thoughts about it. The second section details the modified test case, and its results.

本主题的原标题是全迭代数组比链表显著更快。标题是由于更改为新的测试结果(presented在第二节)。

The original title of this topic was "Full iteration over an array significantly faster than with a linked list". The title was changed due to the newer test results (presented in Section Two).

有关全单向顺序遍历,它的已知的链表和阵列具有类似的性能,但由于该连续阵列的高速缓存友好(参考局部性),它可以执行(略)更好。看看它是如何工作在实践中(的Andr​​oid,Java的),我检查了上述说法,并取得了一定的测量。

For a full one-directional sequential traversal, it's known that the linked list and the array have similar performance, but due to the cache-friendliness (reference locality) of the contiguous array, it may perform (slightly) better. To see how it works in the practice (Android, Java), I examined the above claims, and made some measurements.

首先,我天真的假设。让我们来看看下面的类:

First of all, my naive assumptions. Let's take a look at the following class:

private static class Message {
    public int value1;
    public Message next;

    public Message(java.util.Random r, Message nextmsg) {
        value1 = r.nextInt();
        next = nextmsg;
    }
}

在第一次测量的情况下,其下一步字段将不被使用的。下面code创造100万阵列信息实例,然后遍历在一个循环中的数组。它测量多少时间迭代需要。

In the first measurement scenario, its next field will not be used at all. The below code creates an array of 1,000,000 Message instances, and then iterates over the array in a loop. It measures how much time the iteration takes.

Log.i("TEST", "Preparing...");          

final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message[] messages = new Message[cnt];
for (int i = 0; i < cnt; i++) {
    messages[i] = new Message(r, null);
}           

Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();

for (int i = 0; i < cnt; i++) {
    Message msg = messages[i];
    if (msg.value1 > 564645) {
        val++;
    }
}       
Log.w("TEST", "Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);

第二次测量生成和测量消息链表对象,而不是:

The second measurement builds and measures a linked list of Message objects instead:

Log.i("TEST", "Preparing...");          

final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message current = null;
Message previous = null;
for (int i = 0; i < cnt; i++) {
    current = new Message(r, previous);
    previous = current;
}
previous = null;

Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();
while (current != null) {
    if (current.value1 > 564645) {
        val++;
    }
    current = current.next;
}           

Log.w("TEST","Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);

第一个测试不断产生的 41-44 毫秒,而第二次测试给人的 80-85 MS。链表迭代似乎是慢的100%。

The first test constantly produces 41-44 ms, while the second test gives 80-85 ms. The linked list iteration seems to be 100% slower.

思想和问题我(可能有缺陷)列车如下。我将欢迎(事实上,鼓励)任何修改。

My (possibly flawed) train of thought and issues is as follows. I'll welcome (in fact, encourage) any corrections.

行,我们经常可以读取的阵列是一个连续的存储器块,因此在访问其元素依次是多个高速缓存友好比情况下的一个链表。 但是在我们的例子中,数组的元素是唯一的的对象引用,而不是消息对象本身(在Java中,我们没有价值类型,即结构在C#中,我们可以存储嵌入一个数组)。因此,引用的局部性仅适用于数组元素本身,而这些只指定对象的地址。因此,消息实例(一般)仍可能是无处不在的记忆,所以引用的局部性并不适用于该实例本身。从这一点来说,它看起来像我们一样的情况下链表:在实例自身可能会在内存中驻留无处不在:数组只保证自己的引用被存储在一个连续的块...

OK, we can often read that an array is a contiguous memory block, and thus accessing its elements sequentially is more cache-friendly than in case of a linked list. But in our case, the elements of the array are only object references, and not Message objects themselves (in Java, we don't have value type i.e. struct as in C# that we could store embedded in an array). Therefore, the "locality of reference" only applies to array elements themselves, and these only specify the address of objects. Consequently, the Message instances (in general) could still be "anywhere" in the memory, so the "locality of reference" doesn't apply to the instances themselves. From this point, it looks like we are the same as in case of a linked list: the instances themselves may reside "anywhere" in memory: the array only guarantees that their references are stored in a contiguous block...

...这里说到的用例:完整的顺序遍历(迭代)。首先,让我们来看看我们如何在各种情况下的引用以实例。如果阵列中,这是非常有效的,因为他们是在一个连续的块。 但是的情况下,链表,我们也不错,因为一旦我们访问的消息实例(这就是为什么我们反复!) ,我们马上有引用到下一步实例。而且,由于我们已经访问消息已经是一个领域,访问(下一步),另一场应当由高速缓存支持(同一个对象的字段也有引用当地AFAIK,他们是在一个连续的块,太)。综上所述,它似乎打破这样:

...and here comes the use-case: complete sequential traversal (iteration). First, let's examine how we get the references to instances in each case. In case of the array, it's very efficient, because they're in a contiguous block. But in case of the linked list, we're also good, because once we've accessed a Message instance (that's why we iterate!), we immediately have the reference to the next instance. And since we've accessed a field of Message already, accessing another field (the "next") should be supported by the cache (the fields of the same object also have a locality of references AFAIK, they're in a contiguous block, too). To sum up, it seems to break down to this:

  1. 在该阵列提供了一个高速缓存友好的迭代引用。 消息实例本身可能是无处不在的在内存中,我们需要访问这些了。<​​/ LI>
  2. 链表提供了参考到下一个元素被访问的电流消息实例时获得的。这是自由,因为每个消息实例必须无论如何访问(就像在阵的情况下)。
  1. The array offers a cache-friendly iteration over references. Message instances themselves may be "anywhere" in memory, and we need to visit these, too.
  2. The linked list offers that the reference to the next element is obtained when the current Message instance is accessed. This is "free", because each Message instance must be visited anyway (just like in the array case).

因此​​,基于上述情况,它看起来像阵列并不比链表更好。唯一的例外是当阵列是原始类型的(但在这样的情况下,它是没有意义将它与一个链表比较)。所以我期待他们的表现类似,但他们没有,因为有巨大的差异。事实上,如果我们假设该阵列的索引需要一系列检查每个元素被访问时,该链接的表(理论上)能更快,甚至。 (数组访问的范围检查可能优化掉了JIT,让我明白,这不是一个正确的观点。)

So, based on the above, it looks like the array is not better than the linked list. The only exception is when the array is of primitive type (but in such a case, it wouldn't make sense to compare it with a linked list). So I'd expect that they perform similarly, but they didn't, as there was a huge difference. In fact, if we assume that the array indexing requires a range check each time an element is accessed, the linked list (in theory) could be faster, even. (The range check of the array access is probably optimized away by the JIT, so I understand that this is not a valid point.)

我的猜测有以下几种:

  1. 这可能不是阵列的高速缓存友好,负责的100%的差异。相反,JIT执行的优化不能在壳体链表遍历来完成。如果范围检查和(VM级)空检查被淘汰,然后我猜阵列获得字节code指令可能会快于我的连接场获得(或不管它被称为)指令名单的情况下(?)。

  1. It's probably not the array's cache-friendliness that is responsible for the 100% difference. Instead, the JIT performs optimizations that can't be done in case of the linked list traversal. If the range check and (VM-level) null check are eliminated, then I guess the "array-get" bytecode instruction might be faster than my "field-get" (or whatever it's called) instruction in the linked list case (?).

尽管消息实例可能是无处不在的记忆,他们很可能非常接近对方,因为他们被分配的同时。但百万实例不能被高速缓存,其中只有一个组成部分。在这样的情况下,顺序访问将是高速缓冲存储器型都在阵列和在该链接的表的情况下,因此这并不能解释的差

Even though the Message instances could be "anywhere" in memory, they are probably very close to each other, because they were allocated "at the same time". But 1,000,000 instances can't be cached, only a part of them. In such a case, sequential access would be cache-friendly both in the array and in the linked list case, so this doesn't explain the difference.

一些聪明的prediction(prefetch)消息比如我访问?即在某种程度上仍有缓存友好与消息实例本身,而的情况下数组访问。

Some intelligent "prediction" (prefetch) of the Message instance I'll access? I.e. somehow there is still cache-friendliness with the Message instances themselves, but only in case of the array access.

更新:由于收到了若干意见,我想作出反应,他们在

UPDATE: Since several comments were received, I'd like to react to them below.

@irreputable:

@irreputable:

链接的列表是从高地址访问的低地址。如果   它的其他方式,即下一个点到一个新的对象,而不是   previous对象

the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object

非常好去处!我没想到这个小细节的布局可能影响测试。我今天会测试它,并将与结果返回。 。(编辑:结果都在这里,我已经更新这个职位与第2条)

Very good spot! I didn't think to this little detail that the layout may influence the test. I'll test it today and will return with the results. ( results are here, I've updated this post with "Section 2").

@Torben评论:

@Torben comments:

另外我要说的是,这整个演习似乎​​是pretty的无用。您   在谈论4ms的改进100000迭代。似乎   premature优化。如果你遇到这样的情况,这是一个   瓶颈,那么请形容它,我们可以看看它(因为   它肯定会是一个比这更有趣的问题)。

Also I would say that this whole exercise seems pretty useless. You are talking about 4ms improvement over 100000 iterations. Seems like premature optimization. If you have a situation where this is a bottleneck, then please describe it and we can look into it (because it definitely would be a more interesting problem than this).

如果这不是有趣的你,那么你就可以忽略这个话题(而不是发布4次)。关于premature优化的毫无根据的假设 - 我怕你看了太多的SO和执行太少产业层次发展。具体情况是在一个模拟相关的软件,这可能需要遍历这些列表每秒几次。事实上,超过120毫秒的延迟可能会影响应用程序的响应能力。

If it's not interesting to you, then you can disregard this topic (instead of posting 4 times). About your baseless assumption of "premature optimization" -- I'm afraid you read too much SO and perform too little industrial-level development. The concrete situation is in a simulation-related software which might have to traverse these lists several times per second. Indeed, 120+ ms latency may affect the responsiveness of an app.

我AP preciate的想法,你投入这一点,但我真的找不到一个   问题从您的帖子。 :)编辑:和数组迭代速度提高50%。   100%的速度意味着零消耗的时间。

I appreciate the thought you put into this, but I really can't find a question from your post. :) And array iteration is 50% faster. 100% faster would mean zero time consumed.

我敢肯定,这是从我的职位,我想知道为什么是非常显著差异present,pretty的明显,当参数意味着并非如此​​。感谢您的修正:事实上,我想写的链表的情况是比较慢的100%

I'm sure it was pretty obvious from my post that I'm wondering why is the very significant difference present, when the arguments would imply otherwise. Thanks for the correction: indeed, I wanted to write that the linked list case is 100% slower.

irreputable 的有一个非常有趣的观察对我来说:

irreputable had a very interesting observation for me:

链接的列表是从高地址访问的低地址。如果   它的其他方式,即下一个点到一个新的对象,而不是   previous对象

the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object

我改变了链表结构,使得其接下来指针的方向等于其节点的实例化顺序:

I changed the linked list structure such that the direction of its next pointers is equal to the instantiation order of its nodes:

Message current = null;
Message previous = new Message(r, null);
Message first = previous;
for (int i = 0; i < cnt; i++) {
    current = new Message(r, null);
    previous.next = current;
    previous = current;
}       
previous = current = null;

(请注意,创建的算法可能不是最小巧的一个,我想我知道一个稍微更好的方式。)在code,通过这个链表迭代:

(Note that the creation algorithm might not be the most compact one, I think I knew a slightly nicer way.) The code that iterates through this linked list:

while (first != null) {
    if (first.value1 > 564645) {
        val++;
    }
    first = first.next;
}

而现在的结果我得到的是不断的 37-39 MS(OK,我们可以说这正是阵列的性能,但实际上,它的速度稍快于每个测试用例而是在 80 的颠倒方向的MS链表,不断),这是快一倍!

And now the result I get is constantly 37-39 ms (OK, we can say it's exactly the array's performance, but actually, it's slightly faster in every test case, constantly.) Instead of the 80 ms of the "reversed-direction" linked list, it's twice as fast!

于是我与原数组测试用例进行类似的测试,以及:我改变了数组遍历到相反的方向(以一个递减循环):

Then I made a similar test with the original array test case as well: I changed the array traversal to opposite direction (to a count-down loop):

for (int i = cnt - 1; i >= 0; i--) {
    Message msg = messages[i];
    if (msg.value1 > 564645) {
        val++;
    }
}

和结果是不断的 85-90 MS!最初的测试案例产生的40-41毫秒。

And the result is constantly 85-90 ms! The original test case produced 40-41 ms.

好像有两个新的结论(和一个问题),现在的:

It seems there are two new conclusions (and one question) now:

  1. 原索赔似乎是真实的,(由于contigous内存块)的数组的引用的局部性没有规定的情况下引用类型的优势 (即对象)阵列,当他们与链表相比。这是因为对象数组唯一持有的引用导入(在内存中可以是,在理论上,无处不在,就像情况下,链表)对象实例,而不是对象实例本身。

  1. The original claim seems to be true that the array's "locality of reference" (due to the contigous memory block) doesn't provide an advantage in case of "reference-type" (i.e. object) arrays when they're compared with linked lists. This is because the object arrays only hold references to object instances, not the object instances themselves (which can be, in theory, "anywhere" in memory, just like in case of the linked list).

在我的测试情况下,结果似乎依赖在遍历的方向,甚至在阵列场景(!)。这怎么可能?

In my test cases, the result seems to be dependent on the direction of traversal, even in case of the array scenario (!). How is this possible?

要总结一下我的测试结果:

To sum up my test results:

  1. 在前进方向遍历,链表略优于数组遍历(完全按预期:我们马上有下一步引用当消息获得实例,即甚至没有必要访问数组元素用于获得它的地址)。

  1. In "forward" direction traversal, the linked list slightly outperforms the array traversal (exactly as expected: we have the next reference immediately when a Message instance is obtained, i.e. even there is no need to access an array element for obtaining its address).

在落后的方向穿越,都具有较弱的约100%的性能(和链表也略微优于阵列)。

In "backward" direction traversal, both have about 100% weaker performance (and the linked list also slightly outperforms the array).

任何想法?

更新1: dlthorpe 的提出了非常宝贵的意见。我会复制他们在这里,他们可以帮助找到一个答案,这个谜。

UPDATE 1: dlthorpe made very valuable comments. I'll copy them here, as they might help in finding an answer to this "riddle".

时没有任何迹象表明硬件实现前瞻页面   在内存高速缓存控制器prefetch?相反,仅装载的   所需的内存引用的内存页面,也加载下一个更高   在预期的正向进步的读取页面?这会   通过内存消除正向级数页面加载等待,但   不排除通过反向级数页面加载等待   内存。

Is there any indication that the hardware implements a look-ahead page prefetch in the memory cache controller? Instead of loading only the memory page needed for a memory reference, also load the next higher page in anticipation of a forward progressive read? This would eliminate page load waits for forward progressions through memory but would not eliminate page load waits for reverse progressions through memory.

[...]

我建议完全不同的硬件测试上。大多数移动   设备都在运行某种形式的ARM的SoC。看看测试用例展示   类似的歪斜在英特尔硬件上,像一台PC或Mac。如果你能挖   旧的PowerPC的Mac,甚至更好。如果这些不显示类似的结果,   那么这将指向东西ARM平台或它的独特   Java实现。

I'd suggest testing on radically different hardware. Most mobile devices are running some form of ARM SoC. See if the test cases show similar skew on Intel hardware, like a PC or a Mac. If you can dig up an old PowerPC Mac, even better. If these don't show similar results, then that would point to something unique on the ARM platform or its Java implementation.

[...]

对了,你的访问模式大多是连续的,但在不同的   方向。如果你下面的东西是做prefetch但只有在   一个方向(prefetch下一个更高的地址块),那么就会   扭曲的结果,有利于在那个方向运行测试。

Right, your access patterns are mostly sequential, but in different directions. If something underneath you is doing prefetch but only in one direction (prefetch next higher address block), then that will skew the results in favor of the tests that run in that direction.

更新2:我跑在PC(从2009年2月,8 GB内存,Windows 7的酷睿i7 Nehalem架构)的测试。我使用C#.NET在.NET 2.0源$ C ​​$ C项目(但.NET 4被安装在机器上)。我的结果,2500万消息实例:

UPDATE 2: I ran the tests on a PC (Core i7 Nehalem architecture from Feb. 2009, 8 GB RAM, Windows 7). I used C#.NET in a .NET 2.0 source code project (but the .NET 4 is installed on the machine). My results, with 25 million Message instances:

  • 链接列表: 57-60 MS
  • 数组: 60-63 MS
  • Linked list: 57-60 ms
  • Array: 60-63 ms

似乎的阅读方向不影响结果。

The direction of reading didn't seem to influence the results.

推荐答案

说到PC硬件,早期的硬件prefetchers(比如大约2005年)是在检测和prefetching着存取更好,但最近的硬件应该善于检测两个方向。如果您有兴趣在手机硬件,这是完全可能的,它仍然实现了基本的只进prefetching。

Talking about PC hardware, early hardware prefetchers (say circa 2005) were better at detecting and prefetching forward accesses, but more recent hardware should be good at detecting both directions. If you are interested in mobile hardware, it is totally possible that it still implements basic forward-only prefetching.

在MMU的,这实际上检测访问模式实现正确prefetch之外,这是很常见的硬件时出现缓存缺失时得到一个以上的高速缓存行。通常,这需要的仅仅让形式的下一步的高速缓存行,除了必需的一种,当一个发生遗漏。这种实现将使前进方向在这种情况下,有效地减少一半的高速缓存未命中率的一大优势(假设prefetching是无效的)。

Outside of proper prefetch implemented in the MMU, which actually detects access patterns, it is very common for hardware to get more than one cache line when a cache miss occurs. Often this takes the form of simply getting next cache line, in addition to the required one, when a miss occurs. This implementation would give the forward direction a big advantage by effectively halving the cache miss rate in that case (this assumes prefetching is ineffective).

本地,在酷睿i7,我得到的链表版本稍微好一点的结果〜3.3毫秒为整个迭代,VS 3.5毫秒数组版本 - 使用原始程序(反向重复顺序链接列表时,创作)。所以,我没有看到你做了同样的效果。

Locally, on Core i7, I get slightly better results for the linked list version at ~3.3 ms for the whole iteration, vs 3.5 ms for the array version - when using the original program (which iterates the link list in reverse order of creation). So I don't see the same effect you did.

内环为您的测试,检查val的价值,有很大的影响。电流回路会造成很多的错误predicts,除非JIT编译器是足够聪明,使用CMOV或类似的东西。看来,在我的测试中,它是 - 因为我得到了约1毫微秒/迭代适合在L1小迭代计数。 1毫微秒(约3个周期)不是一个完整的分支误predict一致。当我把它做一个无条件的VAL + = msg.value1,数组版本有一个显著提升,即使1,000,000迭代的情况下(这甚至不会适合在L3,可能)。

The inner loop for your test, checking the value of val, has a big impact. The current loop will cause a lot of mispredicts, unless the JIT compiler is smart enough to use CMOV or something similar. It seems that in my test, it was - since I got about 1 ns / iteration for small iteration counts that fit in L1. 1 ns (about 3 cycles) isn't consistent with a full branch mis-predict. When I changed it to do an unconditional val += msg.value1, the array version got a significant boost, even in 1,000,000 iteration case (which won't even fit in L3, probably).

有趣的是,同样的变换(VAL + = msg.value1)方面的链表版本稍慢。与变换,阵列版本是相当快的小的迭代计数(内侧L2和这两种方法以外可比性)。从卡尺:

Interestingly enough, the same transformation (val += msg.value1) made the linked list version slightly slower. With the transformation, the array version was considerably faster at small iteration counts (inside L2, and the two approaches were comparable outside). From caliper:

  length method         ns linear runtime
     100  ARRAY       63.7 =
     100 LINKED      190.1 =
    1000  ARRAY      725.7 =
    1000 LINKED     1788.5 =
 1000000  ARRAY  2904083.2 ===
 1000000 LINKED  3043820.4 ===
10000000  ARRAY 23160128.5 ==========================
10000000 LINKED 25748352.0 ==============================

有小的迭代计数的行为更容易解释 - 链表,其具有使用指针追逐,具有循环的每次迭代之间的数据相关性。即,每次迭代取决于previous,因为加载地址来自previous元件。该阵列不具有此相同的数据相关性 - 仅增量的i是相关的,这是非常快速的(i是肯定在寄存器这里)。所以循环可以在阵列的情况下更好的流水线。

The behavior for small iteration counts is easier to explain - the linked list, which has to use pointer chasing, has a data dependency between each iteration of the loop. That is, each iteration depends on the previous, because the address to load comes from the previous element. The array doesn't have this same data dependency - only the increment of i is dependent, and this is very fast (i is certainly in a register here). So the loop can be much better pipelined in the array case.

这篇关于磁盘阵列/链表:性能取决于遍历*向*?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆